Download - stacks.stanford.edubn436fy2758/thesis_LZ-augmented… · I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation

COMMON RANDOMNESS, EFFICIENCY, AND ACTIONS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF ELECTRICAL

ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Lei Zhao

August 2011

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/bn436fy2758

© 2011 by Lei Zhao. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/bn436fy2758

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Thomas Cover, Primary Adviser


Itschak Weissman, Co-Adviser


Abbas El-Gamal

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Preface

The source coding theorem and channel coding theorem, first established by Shannon

in 1948, are the two pillars of information theory. The insight obtained from Shan-

non’s work greatly changed the way modern communication systems were thought

and built. As the original ideas of Shannon were absorbed by researchers, the mathe-

matical tools in information theory were put to great use in statistics, portfolio theory,

complexity theory, and probability theory.

In this work, we explore the area of common randomness generation, where remote

nodes use nature’s correlated random resource and communication to generate a

random variable in common. In particular, we investigate the initial efficiency of

common randomness generation as the communication rate goes down to zero, and

the saturation efficiency as the communication exhausts nature’s randomness. We

also consider the setting where some of the nodes can generate action sequences to

influence part of nature’s randomness.

At last, we consider actions in the framework of source coding. The tools from

channel coding and distributed source coding are combined to establish the funda-

mental limit of compression with actions.

iv

Acknowledgements

The five years I spent at Stanford doing my Ph.D. have been a very pleasant and

fulfilling journey. And it is my advisor Thomas Cover, who made it possible. His

weekly round-robin group meeting was the best place for research discussion and was

also full of interesting puzzles and stories. He revealed the pearls of information theory

as well as statistics through all those beautiful examples and always encouraged me

on every small findings I obtained. It is a privilege to work with him and I would like

to thank him for his support, and guidance.

I am also truly grateful to Professor Tsachy Weissman, who taught me amazing

universal schemes in information theory and was always willing to let me do “random

drawing” on his white boards. I really like his way of asking have-we-convinced-

ourselves questions, which often led to surprisingly simple yet insightful discoveries.

Professor Abbas El Gamal is of great influence on me. I would like to extend my

sincere thanks to him. His broad knowledge on network information theory and his

teaching of EE478 were invaluable to my research.

I would like to thank my colleagues at Stanford, especially, Himanshu Asnani,

Bernd Bandemer, Yeow Khiang Chia, Paul Cuff, Shirin Jalali, Gowtham Kumar,

Vinith Misra, Alexandros Manolakos, Taesup Moon, Albert No, Idoia Ochoa, Haim

Permuter, Han-I Su, and Kartik Venkat.

Last but not least, I am grateful to my family. I thank my parents for their

constant support and love. I thank my wife for her love, and for completing my life.

v

Contents

Preface iv

Acknowledgements v

1 Introduction 1

2 Hirschfeld-Gebelein-Renyi maximal correlation 4

2.1 HGR correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Doubly symmetric binary source . . . . . . . . . . . . . . . . . 7

2.2.2 Z-Channel with Bern(1/2) input . . . . . . . . . . . . . . . . . 7

2.2.3 Erasure Channel . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Common randomness generation 11

3.1 Common randomness and efficiency . . . . . . . . . . . . . . . . . . . 11

3.1.1 Common randomness and common information . . . . . . . . 13

3.1.2 Continuity at R = 0 . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.3 Initial Efficiency (R ↓ 0) . . . . . . . . . . . . . . . . . . . . . 14

3.1.4 Efficiency at R ↑ H(X|Y ) (saturation efficiency) . . . . . . . . 16

3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 DBSC(p) example . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.2 Gaussian example . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.3 Erasure example . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vi

3.3.1 CR per unit cost . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2 Secret key generation . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.3 Non-degenerate V . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.4 Broadcast setting . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Common randomness generation with actions 28

4.1 Common randomness with action . . . . . . . . . . . . . . . . . . . . 28

4.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Initial Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.2 Saturation efficiency . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Compression with actions 37

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2.1 Lossless case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2.2 Lossy case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2.3 Causal observations of state sequence . . . . . . . . . . . . . . 40

5.3 Lossless case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.1 Lossless, noncausal compression with action . . . . . . . . . . 40

5.3.2 Lossless, causal compression with action . . . . . . . . . . . . 46

5.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 Lossy compression with actions . . . . . . . . . . . . . . . . . . . . . 51

6 Conclusions 54

A Proofs of Chapter 2 56

A.1 Proof of the convexity of ρ(PX ⊗ PY |X) in PY |X . . . . . . . . . . . . 56

B Proofs of Chapter 3 59

B.1 Proof of the continuity of C(R) at R = 0 . . . . . . . . . . . . . . . . 59

vii

C Proofs of Chapter 4 61

C.1 Converse proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . 61

C.2 Proof for initial efficiency with actions . . . . . . . . . . . . . . . . . 63

C.3 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

D Proofs of Chapter 5 68

D.1 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

D.2 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Bibliography 71

viii

List of Tables

ix

List of Figures

1.1 Generate common randomness: K = K(Xn), K ′ = K ′(Y n) satisfy-

ing P(K = K ′) → 1 as n → ∞. What is the maximum common

randomness per symbol, i.e. what is sup 1nH(K)? . . . . . . . . . . . 2

2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 X ∼ Bern(1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 ρ(X ; Y ) = 1− 2min{p, 1− p} . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Z-channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 ρ(X ; Y ) =√

1−p1+p

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.7 Erasure Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Common Randomness Capacity: (Xi, Yi) are i.i.d.. Node 1 generates

a r.v. K based on the Xn sequence it observes. It also generates a

message M and transmits the message to Node 2 under rate constraint

R. Node 2 generates a r.v. K ′ based on the Y n sequence it observes

and M . We require that P(K = K ′) approaches 1 as n goes to infinity.

The entropy of K measures the amount of common randomness those

two nodes can generate. What is the maximum entropy of K? . . . 12

3.2 The probability structure of Un. . . . . . . . . . . . . . . . . . . . . . 17

3.3 DBSC example: X ∼ Bern(1/2), pY |X(x|x) = (1− p), pY |X(1−x|x) = p. 18

3.4 C(R) for p = 0.08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 Gaussian Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6 Auxiliary r.v. U in Gaussian example. . . . . . . . . . . . . . . . . . 21

x

3.7 Gaussian example: C(R) for N = 0.5 . . . . . . . . . . . . . . . . . . 22

3.8 Erasure example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.9 Erasure example: C − R curve . . . . . . . . . . . . . . . . . . . . . . 23

3.10 Common randomness per unit cost. . . . . . . . . . . . . . . . . . . . 24

3.11 Secret Key Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.12 CR broadcast setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Common Randomness Capacity: {Xi}i=1,... is an i.i.d. source. Node

1 generates a r.v. K based on the Xn sequence it observes. It also

generates a message M and transmits the message to Node 2 under

rate constraint R. Node 2 first generates an action sequence An as a

function of M and receives a sequence of side information Y n, where

Y n|(An, Xn) ∼ p(y|a, x). Then Node 2 generates a r.v. K ′ based on

both M and Y n sequence it observes and M . We require P(K = K ′)

to be close to 1. The entropy of K measures the amount of common

randomness those two nodes can generate. What is the maximum

entropy of K? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 CR with Action example . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Correlate A with X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 CR with action example: option one: set A⊥X ; option two: correlate

A with X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1 Compression with actions. The Action encoder first observes the state

sequence Sn and then generates an action sequence An. The ith out-

put Yi is the output of a channel p(y|a, s) when a = Ai and s = Si.

The compressor generates a description M of 2nR bits to describe Y n.

The remote decoder generates Y n based on M and it’s available side

information Zn as a reconstruction of Y n. . . . . . . . . . . . . . . . 38

5.2 Binary example with side information Z = ∅. . . . . . . . . . . . . . . 48

5.3 The threshold b∗ solves H2(b)−H2(p)b

= dH2

db, b ∈ [0, 1/2] . . . . . . . . . 50

5.4 Comparison between the non-causal and causal rate-cost functions.

The parameter of the Bernoulli noise is set at 0.1. . . . . . . . . . . . 51

xi

Chapter 1

Introduction

Given a pair of random variables (X, Y ) with joint distribution p(x, y), what do they

have in common? Different quantities can be justified as the right measure of “com-

mon” in different settings. For example, in linear estimation, correlation determines

the minimum mean square error (MMSE) when we use one random variable to esti-

mate the other. And the MMSE suggests that the larger the absolute value of the

correlation, the more “commonness” X and Y have. In information theory, insight

about p(x, y) can often be gained when independent and identically distributed (i.i.d.)

copies, (Xi, Yi), i = 1, ..., n, are considered. In source coding with side information,

the celebrated Slepian-Wolf Theorem [21] shows that when compressing {Xi}ni=1 loss-

lessly, the rate reduction by having side information {Yi}ni=1 is the mutual information

I(X ; Y ) between X and Y . It makes a lot of sense that a large rate reduction suggests

a lot in common between X and Y , which indicates that I(X ; Y ) is a good measure.

A more direct attempt addressing the commonness was first considered by Gacs

and Korner in [10]. In their setting, illustrated in Fig. 1.1, nature generates (Xn, Y n) ∼i.i.d. p(x, y). Node 1 observes Xn, and Node 2 observes Y n. The task is for the two

nodes to generate common randomness (CR), i.e., a random variable K in common.

The entropy of the common random variable is the number of common bits gener-

ated by nature’s resource at either node. The supremum of the normalized entropy,1nH(K), is defined as the common information between X and Y . It would be an

extremely interesting measure of commonness if not for the fact that it is zero for a

1

CHAPTER 1. INTRODUCTION 2

Xn Y n

K K ′

Node 1 Node 2

Figure 1.1: Generate common randomness: K = K(Xn), K ′ = K ′(Y n) satisfyingP(K = K ′) → 1 as n → ∞. What is the maximum common randomness per symbol,i.e. what is sup 1

nH(K)?

large class of joint distributions. Witsenhausen [28] used Hirschfeld-Gebelein-Renyi

maximal correlation (HGR correlation) to sharpen the result by Gacs and Korner.

Surprisingly, if the HGR correlation between X and Y is strictly less than 1, not a

single bit in common can be generated by the two nodes.

In this thesis, we investigate the role of HGR correlation in common randomness

generation with a rate-limited communication link between Node 1 and Node 2, with

and without actions. In particular, we link the HGR correlation with initial efficiency,

i.e., the initial rate of common randomness unlocked by communication, thus giving

an operational justification of using HGR correlation as a measure of commonness.

Furthermore, we extend common randomness generation to the setting where one

node can take actions to affect the side information. A single letter expression for

common randomness capacity is obtained, based on which the initial efficiency and

saturation efficiency are derived. The maximum HGR correlation conditioned on a

fixed action determines the initial efficiency.

In the last chapter we consider the problem of compression with actions. While

traditionally in source coding, nature fixes the source distribution, in our setting, we

introduce the idea of using actions to affect nature’s source.

Notation: We use capital letter X to denote a random variable, small letter x

to denote the corresponding realization, calligraphic letter X to denote the alphabet

of X , and |X | to denote the cardinality of the alphabet. The subscripts in joint

CHAPTER 1. INTRODUCTION 3

distributions are mostly omitted. For example pXY (x, y) is written as p(x, y). But

to emphasize the probability structure, we sometimes write the joint distribution as

PX⊗PY |X , where PX is the marginal of X and PY |X as the conditional distribution of

Y given X . We use X ⊥ Y to indicate that X and Y are independent, and X−Y −Z

to indicate that X and Z are conditionally independent given Y . Subscripts and

superscripts are used in the standard way: Xn = (X1, ..., Xn) and Xji = (Xi, ..., Xj).

Most of the notations follow [8].

Chapter 2

Hirschfeld-Gebelein-Renyi

maximal correlation

2.1 HGR correlation

We focus on random variables with finite alphabet.

Definition 1. The HGR correlation [12,14,18] between two random variables (r.v.)

X and Y , denoted as ρ(X ; Y ), is defined as

ρ(X ; Y ) = max Eg(X)f(Y ) (2.1)

subject to Eg(X) = 0, Ef(Y ) = 0,

Eg2(X) ≤ 1, Ef 2(Y ) ≤ 1.

If neither X nor Y is degenerate, i.e., a constant, then the inequalities can be

replaced by equality in the constraints. An equivalent characterization was proved

by Renyi in [18]:

ρ2(X ; Y ) = supEg(Y )=0,Eg2(Y )≤1

E[E2(g(Y )|X)

](2.2)

Note that HGR correlation is a function of the joint distribution p(x, y) and does not

dependent on the support of X and Y . We sometimes use ρ(p(x, y)) or ρ(PX ⊗PY |X)

4

CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 5

to emphasize the joint probability distribution. The HGR correlation shares quite a

few properties with mutual information ρ(X ; Y ).

• Positivity [18]: 0 ≤ ρ(X ; Y ) ≤ 1

◦ ρ(X ; Y ) = 0 iff X⊥Y .

◦ ρ(X ; Y ) = 1 iff there exists a non-degenerate random variable V such that

V is both a function of X and a function of Y .

• Data processing inequality: If X ,Y and Z form a Markov chain X−Y −Z,

then ρ(X ; Y ) = ρ(X ; Y, Z) ≥ ρ(X ;Z).

Proof. Consider any function g such that Eg(X) = 0, Eg2(X) = 1. By the

Markovity X − Y − Z, E [E2(g(X)|Y )] = E [E2(g(X)|Y, Z)]. Thus using the

alternative characterization Eq.(2.2), we have

ρ(X ; Y ) = ρ(X ; Y, Z) ≥ ρ(X ;Z).

• Convexity: Fixing PX , ρ2(PX ⊗ PY |X) is convex in PY |X .

Proof. Consider r.v.’sX, Y1, Y2. Let 0 < λ < 1, and let Θ =

{1, w.p. λ;

2, w.p. 1− λ,

where Θ is independent of (X, Y1, Y2). Let Y = YΘ. We have

ρ2(X ; YΘ) ≤ ρ2(X ; YΘ,Θ)

≤ λρ2(X ; Y1) + (1− λ)ρ2(X ; Y2),

where last inequality comes from the following lemma.

Lemma 1. Assume X⊥Z, where Z has a finite alphabet Z. Let ρ(X ; Y, Z) be

the Renyi correlation between X and (Y, Z).

ρ2(X ; Y, Z) ≤∑

v

PZ(z)ρ2(X ; Y |Z = z), (2.3)


where ρ(X ; Y |Z = z) = ρ(PX ⊗ PY |X,Z=z).

Proof. See Appendix A.1.

However, we note here that ρ2(PX ⊗PY |X) is not concave in PX when fixing PY |X ,

which differs from mutual information. We provide a numerical example:

Consider P1 = [1/2, 1/4, 1/4]T and P2 = [1/3, 1/3, 1/3]. Let Pθ = θP1+ (1− θ)P2.

We show the plots of ρ2(Pθ ⊗PY |X) as a function of θ for two different PY |X matrices

in the following figures:

0 0.2 0.4 0.6 0.8 10.097

0.098

0.099

0.1

0.101

0.102

0.103

0.104

0.105

0.106

0.107

Time sharing

ρ2 (X;Y

)

θ

Figure 2.1:

0 0.2 0.4 0.6 0.8 10.089

0.09

0.091

0.092

0.093

0.094

0.095

0.096

0.097

0.098

Time sharing

ρ2 (X;Y

)

θ

Figure 2.2:

For the left figure, PY |X =

0.0590 0.4734 0.4677

0.3252 0.2415 0.4333

0.1778 0.6230 0.1992

, and for the right figure,

PY |X =

0.3162 0.6139 0.0699

0.6351 0.2702 0.0948

0.5519 0.3570 0.0911

.

2.2 Examples

In this section, we calculate the HGR correlation for a few simple examples.


2.2.1 Doubly symmetric binary source

Let X be a Bern(1/2) r.v. and Y be the output of a binary symmetric channel with

cross probability p and input X . Since X and Y are binary, the HGR correlation can

be easily computed as [9]

ρ(X ; Y ) = 1− 2min{p, 1− p}.

When p = 0, i.e., X and Y are independent, ρ = 0; when p = ±1, X and Y are

X Y

00

11

p

p

1− p

1− p

Figure 2.3: X ∼ Bern(1/2)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

ρ(X

;Y)

Figure 2.4: ρ(X ; Y ) = 1− 2min{p, 1− p}

essentially identical and ρ achieves its maximum value 1. These values agrees with

one’s intuition about the commonness measure on X and Y .

2.2.2 Z-Channel with Bern(1/2) input

Let X be a Bern(1/2). And let us consider the Z-channel 2.5 of with probability p,

that relates X and Y . The HGR correlation can be computed as

ρ(X ; Y ) =

√1− p

1 + p

Note that ρ2(X ; Y ) = −1 + 21+p

, which is a convex function in p.


X Y

00

11

1

p

1− p

Figure 2.5: Z-channel

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

ρ(X

;Y)

Figure 2.6: ρ(X ; Y ) =√

1−p1+p

2.2.3 Erasure Channel

Let (X, Y ) be two random variables with some general distributions p(x, y) and Y be

an erased version of Y with erasure probability q, shown in Fig. 2.7.

X Yp(x, y) Y

e

Figure 2.7: Erasure Channel

It is shown that an erasure erases a portion q of the information between X

and Y , i.e., I(X ; Y ) = (1 − q)I(X ; Y ) [4]. Interestingly, for HGR correlation (HGR

correlation squared to be precise), a similar property holds as proved in the following

lemma:

Lemma 2.

ρ2(X ; Y ) = (1− q)ρ2(X ; Y )

Proof. If either X or Y is degenerate, the proof is trivial. Thus, we only consider

the case where neither X nor Y is degenerate. Define Θ = 1{Y=e}. Note that Θ is

independent of X . For any f and g such that Ef(X) = 0, Ef 2(X) = 1, Eg(Y ) = 0


and Eg2(Y ) = 1, we have

Eg(X)f(Y )

= EΘE[g(X)f(Y )|Θ]

= qE[g(X)f(e)|Θ = 1] + (1− q)E[g(X)f(Y )|Θ = 0](a)= qf(e)Eg(X) + (1− q)E[g(X)f(Y )](b)= (1− q)E[g(X)f(Y )]

= (1− q)E[g(X)]E[f(Y )] + (1− q)E{(g(X)−E[g(X)])(f(Y )− E[f(Y )])

}

(c)= (1− q)E[(g(X)−E[g(X)])(f(Y )− E[f(Y )])](d)

≤ (1− q)√Var(g(X)) Var(f(Y ))ρ(X ; Y )

(e)

≤ (1− q)

√1

1

1− qρ(X ; Y )

= ρ(X ; Y )√

1− q,

where (a) is due to the independence between Θ and X ; (b) and (c) is due to the fact

Ef(X) = 0; (d) comes from the definition of HGR correlation between X and Y ; (e)

is because

1 = Eg2(Y )

= E[g2(Y )|Θ]

= qEg2(e) + (1− q)Eg2(Y )

≥ (1− q)Eg2(Y )

≥ (1− q) Var(g(Y ));

Equality can be achieved by setting f = f ∗ and g(y) =

{1√1−q

g∗(y), if y ∈ Y ,

0, y = e.where f ∗ and g∗ are the functions achieving the HGR correlation between X and Y .

Thus ρ2(X ; Y ) = (1− q)ρ2(X ; Y ).

There are some other interesting non-trivial examples in the literature:

CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION10

• Gaussian: [16] If (X, Y ) ∼ joint Gaussian distribution with correlation r, then

ρ(X ; Y ) = |r|.

• Partial sums: [7] Let Yi, i = 1, ..., n be i.i.d. r.v.’s with finite variance. Let

Sk =∑k

i=1 Yi, k = 1, .., n be the partial sums. Then we have

ρ(Sk, Sn) =√

k/n.

Chapter 3

Common randomness generation

3.1 Common randomness and efficiency

If the HGR correlation between X and Y is close to 1, intuitively there is a lot

in common between the two. The obstacle of generating common randomness in

Fig. 1.1 is the lack of communication between the two nodes. It turns out that a

communication link can be of great help facilitating common randomness generation.

This setting was first considered by Ahlswede and Csiszar [2]. The system has two

nodes. Node 1 observes Xn and Node 2 observes Y n, where (Xn, Y n) ∼ i.i.d. p(x, y).

A (n, 2nR) scheme, shown in Fig. 3.1, consists of

• a message encoding function fm : X n 7→ [1 : 2nR], M = fm(Xn),

• a CR encoding function at Node 1 f1: K = f1(Xn),

• a CR encoding function at Node 2 f2: K′ = f2(M,Y n).

Definition 2. A common randomness-rate pair (C,R) is said to be achievable if there

exists a sequence of schemes at rate R such that

• P(K = K ′) → 1 as n → ∞,

• lim infn→∞1nH(K) → C,

11

CHAPTER 3. COMMON RANDOMNESS GENERATION 12

• lim supn→∞1nH(K|K ′) → 01.

In words, the entropy of K measures the amount of common randomness those

two nodes can generate.

Xn Y n

M ∈ [1 : 2nR]

K K ′

Node 1 Node 2

Figure 3.1: Common Randomness Capacity: (Xi, Yi) are i.i.d.. Node 1 generates ar.v. K based on the Xn sequence it observes. It also generates a message M andtransmits the message to Node 2 under rate constraint R. Node 2 generates a r.v. K ′

based on the Y n sequence it observes and M . We require that P(K = K ′) approaches1 as n goes to infinity. The entropy ofK measures the amount of common randomnessthose two nodes can generate. What is the maximum entropy of K?

Definition 3. The supremum of all the common randomness achievable at rate R is

defined as the common randomness capacity at rate R. That is

C(R) = sup{C : (C,R) is achievable}

Theorem 1. [2] The common randomness capacity at rate R is

C(R) = maxp(u|x):R≥I(X;U)−I(Y ;U)

I(X ;U) (3.1)

If private randomness generation is allowed at Node 1, then

C(R) =

{maxp(u|x):I(U :X)−I(U ;Y )≤R I(U ;X), R ≤ H(X|Y );

R + I(X ; Y ), R > H(X|Y ).(3.2)

1This is a technical condition to constrain the cardinality of K. Mathematically, the conditionguarantees that the converse proof works out


The C(R) curve is thus a straight line with slope 1 for R > H(X|Y ). Although we

focus on 0 ≤ R ≤ H(X|Y ) in this thesis, in most of the figures we plot the straight

line part for completeness.

We note here that computing C(R) is highly related with the information bottle

neck method developed in [25]. The usage of common randomness in generating

coordinated actions is discussed in detail in [6].

3.1.1 Common randomness and common information

Let us clarify the relation between common randomness and common information.

Definition 4. [10] The maximum common r.v. V between X and Y satisfies:

• There exists functions g and f such that V = g(X) = f(Y ).

• For any V ′ such that V ′ = g′(X) = f ′(Y ) for some deterministic functions f ′

and g′, V ′ is a function of V .

Definition 5. [10] The common information between X and Y is defined as H(V )

where V is the maximum common r.v. between X and Y .

It turns out that common information is equal to common randomness at rate 0,

i.e., H(V ) = C(0) [5].

Lemma 3. It is without loss of optimality to assume that V is a function of U when

optimizing maxp(u|x):R≥I(X;U)−I(Y ;U) I(X ;U).

Proof. Because for any U such that I(X ;U)− I(Y ;U) ≤ R and U −X − Y hold, we

can construct a new auxiliary r.v. U ′ = [U, V ]. Note that

• Markov chain U ′ −X − Y holds.


• The rate constraint:

I(X ;U ′)− I(Y ;U ′)

= I(X ;U, V )− I(Y ;U, V )

= I(X ;U |V )− I(Y ;U |V )

= I(X ;U, V )− I(X ;V )− I(Y ;U, V ) + I(Y ;V )

= I(X ;U)− I(Y ;U)

≤ R

• The common randomness generated: I(X ;U ′) ≥ I(X ;U).

Thus using U ′ as the new auxiliary r.v. preserves the rate and does not decrease the

common randomness.

3.1.2 Continuity at R = 0

The C(R) curve is concave for R ≥ 0 thus continuous for R > 0. The following

theorem establishes the continuity at R = 0.

Theorem 2. The common randomness capacity as a function of the communication

rate R is continuous at R = 0, i.e., limR↓0C(R) = C(0).

Proof. See Appendix.

The value of C(0) is equal to the common information defined in [10]. We note

here that C(0) > 0 if and only if ρ(X ; Y ) = 1.

3.1.3 Initial Efficiency (R ↓ 0)

If the commonness between X and Y is large, then it is natural to expect that the

first few bits of communication should be able to unlock a huge amount of common

randomness. It is indeed the case as shown in the following theorem. Furthermore, the

HGR correlation ρ plays the key role in the characterization of the initial efficiency.


Theorem 3. The initial efficiency of common randomness generation is characterized

as

limR↓0

C(R)

R=

1

1− ρ2(X ; Y ).

In words, the initial efficiency is the initial number of bits of common randomness

unlocked by the first few bits of communications.

Comments:

• Since ρ(X ; Y ) = ρ(Y ;X), the slope is symmetric in X and Y . Thus if we reverse

the direction of the communication link, i.e., the message is sent from Node 2

to Node 1 in Fig. 3.1, the initial efficiency remains the same.

• The initial efficiency increases with the HGR correlation ρ between X and Y .

Without communication, as long as ρ < 1, the common randomness capacity is

0. But with communication, the first few bits can “unlock” a huge amount of

common randomness if ρ(X ; Y ) is close to 1.

Proof. If ρ(X ; Y ) = 1, then C(0) > 0 which yields the +∞ slope. For the case

ρ(X ; Y ) < 1, we have

limR↓0

C(R)

R

(a)= sup

p(u|x)

I(X ;U)

I(X ;U)− I(Y ;U)(3.3)

(b)=

1

1− supp(u|x)I(Y ;U)I(X;U)

(c)=

1

1− ρ2(X ; Y )

where (a) comes from the fact C(R) is a concave function; (b) is because function 11−x

is monotonically increasing for x ∈ [0, 1); (c) comes from the following lemma [9].

Lemma 4. [9] supp(u|x)I(Y ;U)I(X;U)

= ρ2(X, Y )


3.1.4 Efficiency at R ↑ H(X|Y ) (saturation efficiency)

At R = H(X|Y ), C(R) reaches its maximum value 2 H(X). That is the point where

Xn is losslessly known at Node 2. In other words, nature’s resource is exhausted by

the system. It is of interest to check the slope of C(R) when R goes up to H(X|Y ).

A natural guess is 1, since one pure random bit (which is independent of nature’s

(Xn, Y n)) sent over the communication link can yield 1 bit in common between the

two nodes. As shown in the erasure example in the next section, this guess is not

correct in general. Here, we provide a sufficient condition for the saturation slope to

be 1.

Theorem 4. The efficiency of common randomness generation at R = H(X|Y ) is 1

if there exist x1, x2 ∈ X such that for all y ∈ Y, if p(x1, y) > 0, then p(x2, y) > 03.

Proof. We have

limR↑H(X|Y )

C(H(X|Y ))− C(R)

H(X|Y )−R(3.4)

(a)= inf

p(u|x)

C(H(X|Y ))− I(X ;U)

H(X|Y )− (I(X ;U)− I(U ; Y ))

(b)= inf

p(u|x)

H(X)− I(X ;U)

H(X|Y )− (I(X ;U)− I(U ; Y ))

= infp(u|x)

H(X|U)

H(X|U)− (H(Y |U)−H(Y |X))

= infp(u|x)

1

1− (H(Y |U)−H(Y |X))H(X|U)

(c)=

1

1− infp(u|x)H(Y |U)−H(Y |X)

H(X|U)

where (a) comes from the concavity of C(R); (b) is because C(H(X|Y )) = H(X);

And (c) is because of the monotonicity of function 11−x

for x ∈ [0, 1).

2If private randomness is allowed, C(R) is a straight line with slope 1 for R > H(X |Y ) [2]. Theresult in this section thus give a sufficient condition for the slope at R = H(X |Y ) to be continuous.

3If two input letters are of the same conditional distribution p(y|x), then we view them as one

letter. Also, the letters with zero probability are discarded.


The next step is to show infp(u|x)H(Y |U)−H(Y |X)

H(X|U)= 0 under the condition given

in the theorem. First note that H(Y |U) − H(Y |X) ≥ 0 because of U − X − Y .

Thus infp(u|x)H(Y |U)−H(Y |X)

H(X|U)≥ 0. Without loss of generality, we can assume that

X = {1, 2, ...,M}, Y = {1, 2, ..., N} and that P(x = 1, y) > 0 implies P(x = 2, y) > 0.

Choose a sequence of positive numbers ǫn converging to 0. Construct a sequence

of Un’s with cardinality {1, ...,M} such that

• PUn(1) = PX(1)− ǫn

1−ǫnPX(2),

• PUn(2) = 1

1−ǫnPX(2),

• PUn(u) = PX(u), u = 3, ...,M ,

which is illustrated in Fig. 3.2. Note that these are valid distributions because we

preserve the marginal distribution of X .

Un X

1

2

3

M

1

2

3

M

Y

1

2

3

N1

1

1

ǫn1− ǫn

PX(1)− ǫn1−ǫn

PX(2)

11−ǫn

PX(2)

PX(3)

PX(M)

......

......

Figure 3.2: The probability structure of Un.

As n goes to infinity, it can be shown that

• The denominator H(X|Un) behaves as ǫn log ǫn, i.e. H(X|Un) ∼= Θ(ǫn log ǫn);

• The numerator H(Y |U)−H(Y |X) behaves linearly, i.e., H(Y |U)−H(Y |X) ∼=Θ(ǫn).

Thus limn→∞H(Y |Un)−H(Y |X)

H(X|Un)= 0, which completes the proof.

For convenience, we introduce saturation efficiency in the following way:

Definition 6. The slope of C(R) when R approaches Rm from below is defined as

the saturation efficiency, where Rm is the threshold such that C(Rm) = H(X).


3.2 Examples

3.2.1 DBSC(p) example

Let X be a Bernoulli (1/2) random variable and let Y be the output of a BSC channel

with cross probability p < 1/2, and with X as the input, shown in Fig. 3.3.

X Y

00

11

p

p

1− p

1− p

Figure 3.3: DBSC example: X ∼ Bern(1/2), pY |X(x|x) = (1− p), pY |X(1− x|x) = p.

H(X|U) = H2(α) for some α ∈ [0, 1/2]. Mrs Gerber’s Lemma [29]provides the

following lower bound on H(Y |U):

H(Y |U) ≥ H2(H−12 (H(X|U)) ∗ p), (3.5)

= H2(α ∗ p)

where (α ∗ p) = α(1− p) + (1− α)p. Thus

I(X ;U) = H(X)−H(X|U) = 1−H2(α),

I(X ;U)− I(Y ;U) = H(X)−H(X|U)−H(Y ) +H(Y |U)

= H(Y |U)−H(X|U)

≥ H2(α ∗ p)−H2(α).

Equality can be achieved by setting p(u|x) ={

x, w.p. 1− α;

1− x, w.p. α,

as shown in Fig. 3.2.1.


XU

00

11

α

α

1− α

1− α

We can write C(R) in parametric form:

C = 1−H2(α) (3.6)

R = H2(α ∗ p)−H2(α), (3.7)

for α ∈ [0, 1/2]. Fig. 3.4 shows C(R) for p = 0.08.

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

C

R

Figure 3.4: C(R) for p = 0.08.

The initial efficiency:

limR↓0

C(R)

R(3.8)

= limα↑1/2

1−H2(α)

H2(α ∗ p)−H2(α)


= limα↑1/2

− log21−αα

(1− 2p) log21−(1−2p)α−p(1−2p)α+p

− log21−αα

= limα↑1/2

11−α

+ 1α

(1− 2p)( −(1−2p)1−(1−2p)α−p

− 1−2p(1−2p)α+p

)+ ( 1

1−α+ 1

α)

=1

1− (1− 2p)2

Note that the HGR correlation between X and Y is (1− 2p)2.

The saturation efficiency, C ′(R−) as R approaches H(X|Y ):

limR↑H(X|Y )

C(H(X|Y ))− C(R)

H(X|Y )−R

= limα↓0

1− (1−H2(α))

H2(p)− (H2(α ∗ p)−H2(α))

= limα↓0

H2(α))

H2(p)−H2(α ∗ p) +H2(α)

= limα↓0

log21−αα

−(1− 2p) log21−(1−2p)α−p(1−2p)α+p

+ log21−αα

= limα↓0

log2 α

log2 α= 1

3.2.2 Gaussian example

Although we mainly consider discrete random variables with finite alphabet, the

results can be extended to continuous random variables as well. In this section, we

consider a Gaussian example. Let Y = X + Z, where X ∼ N (0, 1), Z ∼ N (0, N),

and X and Z are independent, illustrated in Fig. 3.5.

Let h(X|U) = 12log2(2πeα) for some 0 < α ≤ 1. The entropy power inequality [19]


X ∼ N (0, 1) Y

Z ∼ N (0, N)

⊕

Figure 3.5: Gaussian Example

gives the following lower bound on h(Y |U):

h(Y |U) ≥ 1

2log2

(22h(X|U) + 22h(Z|U)

)(3.9)

=1

2log2

(22

12log2(2πeα) + 22

12log2(2πeN)

)

=1

2log2 (2πe(α +N))

Equality can be achieved by X = U+U ′ where U⊥U ′, U ∼ N (0, 1−α), U ′ ∼ N (0, α),

shown in Fig. 3.6

XU ∼ N (0, 1− α)

V ∼ N (0, α)

⊕

Figure 3.6: Auxiliary r.v. U in Gaussian example.

We write C(R) in a parametric form:

C = −1

2log2 α (3.10)

R =1

2log

α +N

(1 +N)α(3.11)

for α ∈ (0, 1]. Fig. 3.7 shows the case N = 0.5.

The initial efficiency is calculated in the following way:

limR↓0

C(R)

R= lim

α↑1

−12log2 α

12log α+N

(1+N)α

(3.12)


0.2 0.4 0.6 0.8 1 1.2 1.4

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

C

R

Figure 3.7: Gaussian example: C(R) for N = 0.5

= limα↑1

− 12α

12(α+N)

− 12α

= 1 +1

N

Note the ordinary correlation between X and Y is 1/√1 +N . For a pair of joint

Gaussian random variables the HGR correlation is equal to the ordinary correlation

[16]. One can use Theorem 3 to obtain the same expression.

Asymptotic saturation efficiency

limR↑∞

dC(R)

dR= lim

α↓0

dC(α)dαdRdα

= limα↓0

−12log2 α

12log α+N

(1+N)α

= limα↓0

− 12α

12(α+N)

− 12α

= 1

For continuous r.v.’s, nature’s randomness is not exhausted at any finite R. It is

always more efficient to generate common randomness from nature’s resources than

from communicating private randomness generated locally.


3.2.3 Erasure example

Let Y be an randomly erased version of X , i.e., Y =

{X, w.p. 1− q

e, w.p. q., shown in

Fig. 3.8.

X Y

e

Figure 3.8: Erasure example

For any U such that U − X − Y holds, I(Y ;U) = (1 − q)I(X ;U), I(X ;U) −I(X ; Y ) = qI(X ;U). Thus C(R) = R

qfor 0 ≤ R ≤ H(X|Y ), where H(X|Y ) =

q log2 |X |, shown in Fig. 3.9.

pH(X)

H(X)

R

C

Figure 3.9: Erasure example: C − R curve

The initial efficiency is therefore 1q. Since ρ(X ;X) = 1 and Y is an erased version

of X , we have ρ(X ; Y ) =√1− q. Note that 1

q= 1

1−ρ2(X;Y ).

The saturation efficiency is limR↑H(X|Y )C(R)R

= 1p, which is not equal to 1.


3.3 Extensions

3.3.1 CR per unit cost

The communication link between Node 1 and Node 2 in Fig. 3.1 is a bit pipe, which

is essentially a noisyless channel. It turns out that the common randomness capacity

remains unchanged when we replace the bit pipe with a noisy channel with the same

capacity [2]. More interestingly, one may consider the case where the channel inputs

are subject to some cost constraints β. The initial efficiency of channel capacity C as

a function of β is solved in the seminal paper [26]. The initial efficiency of the overall

system, illustrated in Fig. 3.10, is thus the product of the initial efficiency of common

randomness generation and the capacity per unit cost of the channel.

Xn Y n

C(β)

K K ′

Node 1 Node 2

Figure 3.10: Common randomness per unit cost.

Corollary 1. The initial efficiency of Fig. 3.10 (common randomness per unit cost)

is equal to

limβ↓0

C(β)

β=

1

1− ρ2(X ; Y )· limβ↓0

C(β)β

We refer to [26] the calculation of limβ↓0C(β)β

.

3.3.2 Secret key generation

Common randomness generation is closely related to secret key generation [1]. Sup-

pose there is an eavesdropper listening to the communication link (Fig. 3.11). We


would like the common randomness generated by Node 1 and Node 2 to be kept away

from the eavesdropper. One commonly used secrecy constraint is that

lim supn→∞

1

nI(M ;K) = 0,

where M ∈ [1 : 2nR] is the message Node 1 sends to Node 2.

Xn Y n

M ∈ [1 : 2nR]

K

K ′Node 1 Node 2

Eavesdropper1nI(M ;K) ≤ ǫ

Figure 3.11: Secret Key Generation

The secret key capacity is shown to be [1]: C(R) = maxR≥I(X;U)−I(Y ;U) I(Y ;U).

We can calculate the initial efficiency of the secret key capacity in the following way:

limR↓0

C(R)

R= sup

I(Y ;U)

I(X ;U)− I(Y ;U)

= supI(X ;U)

I(X ;U)− I(Y ;U)− 1

=1

1− ρ2(X ; Y )− 1,

which is the initial efficiency without the secrecy constraint minus one. It makes

sense, because the eavesdropper observers every bit Node 1 communicates to Node 2.


3.3.3 Non-degenerate V

If the maximum common r.v. V is not a constant, the slope of C(R) as R ↓ 0 (It

differs from limR↓0C(R)R

) can be calculated in the following way:

limR↓0

C(R)− C(0)

R= sup

I(Y ;U)−H(V )

I(X ;U)− I(Y ;U)

(a)= sup

I(Y ;U, V )−H(V )

I(X ;U, V )− I(Y ;U, V )

= supI(Y ;U |V )

I(X ;U |V )− I(Y ;U |V )

=1

1− sup I(X;U |V )I(Y ;U |V )

=1

1−maxv ρ2(X ; Y |V = v)

where (a) is due to Lemma 3.

3.3.4 Broadcast setting

The common randomness generation setup can be generalized to multiple nodes.

A broadcast setting was considered in [2], shown in Fig. 3.12. The goal is for all

three nodes to generate a random variable K in common. The common randomness

Xn

Y n1

Y n2

R

K

K

K

Node 1

Node 2

Node 3

Figure 3.12: CR broadcast setup


capacity is proved [2] to be

C(R) = maxp(u|x):R≥I(X;U)−I(Yi;U)≤R,i=1,2

I(X ;A,U)

We provide a conjecture that deals with the initial efficiency in the broadcast setting:

Conjecture 1. The initial efficiency of the setup in Fig. 3.12 is:

limR↓0

C(R)

R=

1

1−Ψ2(X ; Y, Z),

where Ψ(X ; Y, Z) is a modified HGR correlation between X and Y, Z, defined in the

following way:

Ψ(X ; Y, Z) = max min{Eg(X)f(Y ), Eg(X)h(Z)}

where the maximization is among all functions g f and h such that Eg(X) = 0, Ef(Y ) =

0, Eh(Z) = 0, Eg2(X) ≤ 1, Ef 2(Y ) ≤ 1, Eh2(Z) ≤ 1.

Proof. Achievability:

Similar to the HGR correlation, there is an alternative characterization of Ψ(X ; Y, Z):

Ψ2(X ; Y, Z) = max min{E(E[g(X)|Y ])2, E(E[g(X)|Z])2}

where the maximization is among function g such that Eg(X) = 0, Eg2(X) ≤ 1.

Applying the maximizer g∗(·) in the achievability scheme in [9], one can show that

the initial efficiency 11−Ψ2(X;Y,Z)

is achievable.

Chapter 4

Common randomness generation

with actions

4.1 Common randomness with action

Recently, in the line of work by Weissman, et al. [27], action was introduced as a

feature that one node can explore to boost the performance of lossy compression.

We adopt their setting but consider common randomness generation. The setup is

shown in Fig. 4.1. Comparing with the no action case, the key difference is that after

receiving the message M , Node 2 first generates an action sequence An based on M ,

i.e., An = fa(M). It then gets the side information Y n according to p(y|x, a), i.e.,Y n|(An, Xn) ∼ ∏n

i=1 p(yi|ai, xi). One scenario where this setting applies is that Node

2 requests side information from some data center through actions. The ith action

determines the type of the side information correlated with Xi that the data center

sends back to Node 2. Node 2 then generates K ′ based both on the Y n sequence

it observes and the message M(Xn) ∈ [1 : 2nR], K ′ = f2(Yn,M). The common

randomness capacity at rate R is defined in the same way as in the no action case.

28

CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 29

Xn Y nAn

M ∈ [1 : 2nR]

K K ′

Node 1 Node 2

Figure 4.1: Common Randomness Capacity: {Xi}i=1,... is an i.i.d. source. Node1 generates a r.v. K based on the Xn sequence it observes. It also generates amessage M and transmits the message to Node 2 under rate constraint R. Node 2first generates an action sequence An as a function of M and receives a sequence ofside information Y n, where Y n|(An, Xn) ∼ p(y|a, x). Then Node 2 generates a r.v.K ′ based on both M and Y n sequence it observes and M . We require P(K = K ′) tobe close to 1. The entropy of K measures the amount of common randomness thosetwo nodes can generate. What is the maximum entropy of K?

Theorem 5. The common randomness action capacity at rate R is

C(R) = max I(X ;A,U)

where the joint distribution is of the form p(a, u|x)p(x)p(y|a, x), and the maximization

is among all p(a, u|x) such that

I(X ;A) + I(X ;U |A)− I(Y ;U |A) ≤ R.

Cardinality of U can be bounded by |U| ≤ |X ||A|+ 1.

Setting A = ∅, we recover the no action result.

Achievability proof

Codebook generation

• Generate 2n(I(X;A)+ǫ) An(l1) sequences according to∏n

i=1 pA(ai), l1 ∈ 2n(I(X;A)+ǫ).


• For each An(l1) sequence, generate 2n(I(X;U |A)+ǫ) Un(l1, l2) sequences according

to∏n

i=1 pU |A(ui|ai), l2 ∈ [1 : 2n(I(X;U |A)+ǫ)].

• For each An(l1) sequence, partition the set of l2 indices into 2n(I(X;U |A,Y )+2ǫ)

equal sized bins, B(l3).

Encoding

For simplicity, we will assume that the encoder is allowed to randomize, but the

randomization can be readily absorbed into the codebook generation stage, and hence,

does not use up the encoder’s private randomization.

• Given xn, the encoder selects the index LA ∈ [1 : 2n(I(X;A)+ǫ)] of the an(LA)

sequence such that (xn, an(LA)) ∈ T (n)ǫ . If there is none, it selects an index

uniformly at random from [1 : 2n(I(X;A)+ǫ)]. If there is more than one such

index, it selects an index uniformly at random from the set if indices such that

(xn, an(l)) ∈ T (n)ǫ .

• Given xn and the selected an(LA), the encoder then selects an index LU ∈ [1 :

2n(I(X;U |A)+ǫ)] such that (xn, an(LA), un(LA, LU)) ∈ T (n)

ǫ .

• The encoder sends out LA and LB ∈ [1 : 2n(I(X;U |A,Y )+2ǫ)] such that LU ∈ B(LB).

Decoding

The decoder first takes actions based on the transmitted An(LA) sequence. Therefore,

Y n is generated based on Y n ∼ ∏ni=1 p(yi|xi, ai(LA)). Given an and side information

yn, the decoder then tries to decode the LU index. That is, it looks for the unique LU

index in bin B(LB) such that (yn, an(LA), un(LA, LU)) ∈ T (n)

ǫ . Finally, the decoder

declares LA, LU as the common indices.

Analysis of probability of error

The analysis of probability of error follows standard analysis. An error occurs if any

of the following two events occur.


1. (an(LA), un(LA, LU), X

n, Y n) /∈ T (n)ǫ .

2. There exists more than one LU ∈ B(LB) such that (Y n, an(LA), un(LA, LU)) ∈

T (n)ǫ .

The probability of the first error goes to zero as n goes to infinity since we generated

enough sequences to cover Xn in the codebook generation stage. The fact that the

probability of error for the second error event goes to zero as n → ∞ follows from

standard Wyner-Ziv analysis.

Analysis of common randomness rate

We analyze the common randomness rate averaged over codebooks.

H(LA, LU |C) = H(LA, LU , Xn|C)−H(Xn|C, LA, LU)

≥ H(Xn|C)−H(Xn|C, LA, LU , An(LA), U

n(LA, LU))

≥ nH(X)−H(Xn|Un, An). (4.1)

The second step follows from the fact that Xn is independent of the codebook and

the third step follows conditioning reduces entropy. We now proceed to upper bound

H(Xn|Un, An). Define E := 1 if (Xn, Un, An) /∈ T (n)ǫ and 0 otherwise.

H(Xn|Un, An) ≤ H(Xn, E|Un, An)

= H(E) +H(Xn|E,Un, An)

≤ 1 + P(E = 0)H(Xn|E = 0, Un, An) + P(E = 1)H(Xn|E = 1, Un, An)

(a)

≤ 1 + n(H(X|U,A) + δ(ǫ)) + nP(E = 1) log |X |= n(H(X|U,A) + δ′(ǫ)). (4.2)

(a) follows from the fact that when E = 0, (Un, An, Xn) ∈ T (n)ǫ . Hence, there are at

most 2n(H(X|U,A)+δ(ǫ)) possible Xn sequences. The last step follows from P(E = 1) → 0

as n → ∞, which in turn follows from the encoding scheme. Combining (4.1) with


(4.2) then gives the desired lower bound on the achievable common randomness rate.

1

nH(LA, LU |C) ≥ H(X)−H(X|U,A)− δ′(ǫ)

= I(X ;U,A)− δ′(ǫ).

Converse: See Appendix C.1.

4.2 Example

By correlating the action sequence An with Xn and communicating the action se-

quence with Node 2, we incur a communication rate cost I(X ;A). That only gen-

erates I(X ;A) in the rate of CR generation. Using 1 bit of communication to get

1 bit common randomness is of course sub-optimal, but the benefit comes in the

second stage where conditioned on the An sequence, Un is sent to Node 2. The com-

munication rate required is I(X ;U |A) − I(Y ;U |A) and the rate of CR generated is

I(X ;U |A).One greedy scheme is to simply fix the action and just repeat it (so there is no

need to communicate An). We use the following example to show explicitly that in

general this kind of scheme is suboptimal.

Let X be a r.v. uniformly distributed over the set {1, 2, 3, 4}. There are two

actions A = 1 and A = 2. The probability structure conditioned on each sequence is

shown in 4.2.

Lemma 5. For the setup in Fig. 4.2,

• setting A⊥X: the optimal achievable (C,R) pair is given as

C(R) =3

2+

R

p, R ∈ [0, p/2]


A = 1

X

1

2

3

4

e

Y

1

2

3

4

p

1− p

A = 2

X

1

2

3

4

e

Y

1

2

3

4

p

1− p

Figure 4.2: CR with Action example

• correlating A with X as shown in Fig. 4.3, the following (C,R) pair is achiev-

able:

C(α) = 2− α

R(α) = 1−H2(α), α ∈ [0, 1/2].

Proof. See Appendix

X A

α

α

1− α

1− α

0

1

1, 2

3, 4

Figure 4.3: Correlate A with X

It can be shown that the (C,R) pair achieved by setting A⊥X cannot be the


optimal one for all R in general. We illustrate this by a numerical example p = 0.6

with the results plotted in Fig. 4.4.

0 0.05 0.1 0.15 0.21.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

Option oneOption two

R

C

Figure 4.4: CR with action example: option one: set A⊥X ; option two: correlate Awith X .

4.3 Efficiency

4.3.1 Initial Efficiency

For simplicity, we assume that ρ(PX ⊗ PY |X,A=a) < 1, ∀a ∈ A.

limR↓0

C(R)

R(4.3)

= supp(a,u|x)

I(X ;A,U)

I(X ;A) + I(X ;U |A)− I(Y ;U |A)

=1

1− supp(a,u|x)I(Y ;U |A)

I(X;A)+I(X;U |A)

=1

1−maxa∈A ρ2(X, Y |A = a)

where ρ(X, Y |A = a) = ρ(PX ⊗ PY |X,A=a) and the last step is proved in Ap-

pendix. C.2.


4.3.2 Saturation efficiency

Similar to the no action case, when the communication rate reaches the threshold such

that Xn can be losslessly reconstructed at Node 2, nature’s randomness Xn is ex-

hausted by the system. Thus the maximum CR H(X) (without private randomness)

is achieved. This threshold Rm can be computed as [27]:

Rm = minp(a|x)

I(X ;A) +H(X|A, Y ).

The following theorem consider the slope of CR generation when R ↑ Rm.

Theorem 6. If there exists a p(a|x) such that

• I(X ;A) +H(X|A, Y ) = Rm

• For each action a, P(A = a) > 0, there exist x1, x2 ∈ X such that P(X = x1|A =

a)) > 0, P(X = x2|A = a) > 0, if p(y, x1|A = a) > 0, then p(y, x2|A = a) > 0,

∀y ∈ Y.

then

limR↑Rm

dC(R)

dR= 1

Essentially, we require the condition in the no action setting to hold for each active

action when R ↑ Rm.

4.4 Extensions

Theorem 5 extends to the case where there is a cost function Λ and a cost constraint

Γ on the action sequence, i.e., Λ(An) = 1n

∑ni=1 Λ(Ai) ≤ Γ.

Corollary 2. The common randomness capacity with rate constraint R and cost

constraint Γ is

C(R,Γ) = max

p(a, u|x) : Λ(A) ≤ Γ

R ≥ I(X ;A) + I(X ;U |A, Y )

I(X ;A,U)


Proof. Simply note that the achievablity proof and converse carry over to this setting

directly.

Theorem 5 also extends naturally to the case where there are multiple receivers

with different side information (Fig. 4.4).

Xn

Y n1

Y n2An

An

R

K1

K2

K3

Node 1

Node 2

Node 3

Corollary 3. The common randomness capacity with rate constraint R with two

receivers and side information structure (Y n1 , Y

n2 )|Xn, An ∼ i.i.d. p(y1, y2|x, a) is

given by

C(R) = max

p(a, u|x) : Λ(A) ≤ Γ

R ≥ I(X ;A) + I(X ;U |A, Yi), i = 1, 2

I(X ;A,U)

Because the action sequence of each node is a function of the same message both

receives, one node knows the action sequence of the other node. Therefore we do

not lose optimality by setting Ai = (A1i, A2i), where A1i and A2i are the individual

actions.

Proof. We may simply repeat the achievablity proof for each receiver, and recognize

that the auxiliary random variable UQ in the converse proof C.1 works for both

receivers.

Chapter 5

Compression with actions

5.1 Introduction

Consider an independent, identically distributed (i.i.d) binary sequence Sn, S ∼Bern(1/2). From standard source coding theory [19], we need at least one bit per

source symbol to describe the sequence for lossless compression. But suppose now

that we are allowed to make some modifications, subject to cost constraints, to the

sequence before compressing it, and we are only interested in describing the modified

sequence losslessly. The problem then becomes one of choosing the modifications so

that the rate required to describe the modified sequence is reduced, while staying

within our cost constraints. More concretely, for the binary sequence Sn, if we are

allowed to flip more than n/2 ones to zero, then the rate required to describe the

modified sequence is essentially zero. But what happens when we are allowed to flip

fewer than n/2 ones?

As a potentially more practical example, imagine we have a number of robots

working on a factory floor and the positions of all the robots need to be reported to

a remote location. Letting S represent the positions of the robots, we would expect

to send H(S) bits to the remote location. However, this ignores the fact that the

robots can also take actions to change their positions. A local command center can

first “take a picture” of the position sequence and then send out action commands

to the robots based on the picture so that they move in cooperative way such that

37

CHAPTER 5. COMPRESSION WITH ACTIONS 38

the final position sequence requires fewer bits to describe. The command center may

face two issues in general: cost constraints and uncertainty. A cost constraint occurs

because each robot should save its power and not move too far away from its current

location. The uncertainty is a result of the robots not moving exactly as instructed

by the local command center.

Motivated by the preceding examples, we consider the problem illustrated in

Fig. 5.1 (Formal definitions will be given in the next section). Sn is our observed

state sequence. We model the constraint as a general cost function Λ(·, ·, ·) and the

uncertainty in the final output Y by a channel p(y|a, s).

PY |X,S

Sn

Y n 2nR

Zn

Y nAn

Compressor DecoderAction Encoder

Figure 5.1: Compression with actions. The Action encoder first observes the statesequence Sn and then generates an action sequence An. The ith output Yi is theoutput of a channel p(y|a, s) when a = Ai and s = Si. The compressor generates adescription M of 2nR bits to describe Y n. The remote decoder generates Y n basedon M and it’s available side information Zn as a reconstruction of Y n.

Our problem setup is closely related to the channel coding problem when the

state information is available at the encoder. The case where the state information

is causally available was first solved by Shannon in [20]. When the state information

is non-causally known at the encoder, the channel capacity result was derived in

[11] and [13]. Various interesting extensions can be found in [15, 17, 22–24]. The

difference in our approach described here is that we make the output of the channel

as compressible as possible. We give formal definitions for our problem are given in

the next section. Our main results when the decoder requires lossless reconstruction

are given in section 5.3, where we characterize the rate-cost tradeoff function for the

setting in Fig. 5.1. We also characterize the rate-cost function when Sn is only

causally known at the action encoder. In section 5.4, we extend the setting to the

lossy case where the decoder requires a lossy version of Y n.


5.2 Definitions

We give formal definitions for the setups under consideration in this section. We will

follow the notation of [8]. Sources (Sn, Zn) are assumed to be i.i.d.; i.e. (Sn, Zn) ∼∏n

i=1 pS,Z(si, zi).

5.2.1 Lossless case

Referring to Figure 5.1, a (n, 2nR) code for this setup consists of

• an action encoding function fe : Sn → An;

• a compression function fc : Yn → M ∈ [1 : 2nR];

• a decoding function fd : [1 : 2nR]× Zn → Y n.

The average cost of the system is EΛ(An, Sn, Y n) , 1n

∑ni=1 EΛ(Ai, Si, Yi). A rate-

cost tuple (R,B) is said to be achievable if there exists a sequence of codes such

that

lim supn→∞

Pr(Y n 6= fd(fc(Yn), Zn)) = 0, (5.1)

lim supn→∞

EΛ(An, Sn, Y n) ≤ B, (5.2)

where Λ(An, Sn, Y n) =∑n

i=1 Λ(Ai, Si, Yi)/n. Given cost B, the rate-cost function,

R(B), is then the infimum of rates R such that (R,B) is achievable.

5.2.2 Lossy case

We also consider the setup where the decoder requires a lossy version of Y n. The

definitions remain largely the same, with the exception that the probability of error

constraint, inequality (5.1), is replaced by the following distortion constraint:

lim supn→∞

E d(Y n, Y n) = lim supn→∞

1

n

n∑

i

E d(Yi, Yi) ≤ D. (5.3)


A rate R is said to be achievable if there exists a sequence of (n, 2nR) codes satisfying

both the cost constraint (inequality 5.2) and the distortion constraint (inequality 5.3).

Given cost B and distortion D, the rate-cost-distortion function, R(B,D), is then the

infimum of rates R such that the tuple (R,B,D) is achievable.

5.2.3 Causal observations of state sequence

In both the lossless and lossy case, we will also consider the setup when the state

sequence is only causally known at the action encoder. The definitions remain the

same, except for the action encoding function which is now restricted to the following

form: For each i ∈ [1 : n], fe,i : Si → A.

5.3 Lossless case

In this section, we present our main results for the lossless case. Theorem 7 gives

the rate-cost function when the state sequence is noncausally available at the action

encoder, while Theorem 8 gives the rate-cost function when the state sequence is

causally available.

5.3.1 Lossless, noncausal compression with action

Theorem 7 (Rate-cost function for lossless, noncausal case). The rate-cost function

for the compression with action setup when state sequence Sn is noncausally available

at the action encoder is given by

R(B) = minp(v|s),a=f(s,v):EΛ(S,A,Y )≤B

I(V ;S|Z) +H(Y |V, Z), (5.4)

where the joint distribution is of the form p(s, v, a, y) = p(s)p(v|s)1{f(s,v)=a}p(y|a, s).The cardinality of the auxiliary random variable V is upper bounded by |V| ≤ |S|+2.

Remarks

• Replacing a = f(s, v) by a general distribution p(a|s, v) does not decrease the

minimum in (5.4). For any joint distribution p(s)p(s|v)p(a|s, v), we can always


find a random variable W and a function f such that W is independent of S, V

and Y , and A = f(V,W,X). Consider V ′ = (V,W ). The Markov condition

V ′ − (A, S) − (Y, Z) still holds. Thus H(Y |V ′, Z) + I(V ′;S|Z) is achievable.

Furthermore,

I(V ′;S|Z) +H(Y |V ′, Z)

= I(V,W ;S|Z) +H(Y |V,W,Z)

≤ I(V,W ;S|Z) +H(Y |V, Z)= I(V ;S|Z) +H(Y |V, Z).

• R(B) is a convex function in B.

• For each cost function Λ(s, a, y), we can replace it with a new cost function

involving only s and a by defining Λ′(s, a) = E[Λ(S,A, Y )|S = s, A = a]. Note

that Y is distributed as p(y|s, a) given S = s, A = a.

Achievability of Theorem 7 involves an interesting observation in the decoding oper-

ation, but before proving the theorem, we first state a corollary of Theorem 7, the

case when side information is absent (Z = ∅). We will also sketch an alternative

achievability proof for the corollary, which will serve as a contrast to the achievability

scheme for Theorem 7.

Corollary 4 (Side information is absent). If Z = ∅, then rate-cost function is given

by

R(B) = minp(v|s),a=f(s,v):EΛ(S,A,Y )≤B

I(V ;S) +H(Y |V )

for some p(s, v, a, y) = p(s)p(v|s)1{f(s,v)=a}p(y|a, s).

Achievability for Corollary 1

Code book generation: Fix p(v|s) and f(s, v) and ǫ > 0.


• Generate 2n(I(S;V )+ǫ) vn(l) sequences independently, l ∈ [1 : 2n(I(V ;S)+ǫ)], each

according to∏

pV (vi) to cover Sn.

• For each V n sequence, the Y n sequences that are jointly typical with V n are

indexed by 2(n(H(Y |V )+ǫ) numbers.

Encoding and Decoding:

• The action encoder looks for a V n in the code book that is jointly typical with

Sn and generates Ai = f(Si, Vi), i = 1, ..., n.

• The compressor looks for a V n in the codebook that is jointly typical with the

channel output Y n and sends the index of that V n sequence to the decoder.

The compressor then sends the index of Y n as described in the second part of

code book generation.

• The decoder simply uses both indices from the compressor to reconstruct Y n.

Using standard typicality arguments, we can show that the encoding succeeds

with high probability and the probability of error can be made arbitrarily small.

Remark: Note that the index of V n is not necessarily equal to V n. That is, the

V n codeword chosen by the action encoder can be different from the V n codeword

chosen by the compressor. But this is not an error event since we still recover the

same Y n even if a different V n codeword was used.

This scheme, however, does not extend to the case when side information is avail-

able at the decoder. The term H(S|Z, V ) in Theorem 7 requires us to bin the set of

Y n sequences according to the side information available at the decoder. If we were

to extend the above achievability scheme, we would bin the set of Y n sequences to

2n(H(Y |Z,V )+ǫ) bins. The compressor would find a V n sequence that is jointly typical

with Y n, send the index to the decoder using a rate of I(V ;S|Z) + ǫ, and then send

the index of the bin which contains Y n. The decoder would then look for the unique

Y n sequence in the bin that is jointly typical with V n and Zn. Unfortunately, while

the V n codeword is jointly typical with Y n with high probability, it is not necessarily

jointly typical with Zn, since V n may not be equal to V n (V n is jointly typical with


Zn with high probability as V n is jointly typical with Sn with high probability and

V − S − Z). One could try to overcome this problem by insisting that the compres-

sor finds the same V n sequence as the action encoder, but this requirement imposes

additional constraints on the achievable rate.

Instead of requiring the compressor to find a jointly typical V n sequence, we use an

alternative approach to prove Theorem 7. We simply bin the set of all Y n sequences to

2n(I(V ;S|Z)+H(Y |Z,V )+ǫ) bins and send the bin index to the decoder. The decoder looks

for the unique Y n sequence in bin M such that (V n(l), Y n, Zn) are jointly typical for

some l ∈ [1 : 2n(I(V ;S)+ǫ)]. Note that there can more than one V n(l) sequence which is

jointly typical with (Y n, Zn), but this is not an error event as long as the Y n sequence

in bin M is unique. We now give the details of this achievability scheme.

Proof of achievability for Theorem 7

Codebook generation

• Generate 2n(I(V ;S)+δ(ǫ)) V n codewords according to∏n

i=1 p(vi)

• For the entire set of possible Y n sequences, bin them uniformly at random to

2nR bins, where R > I(V ;S)− I(V ;Z) +H(Y |Z, V ), B(M).

Encoding

• Given sn, the encoder looks for a vn sequence in the codebook such that

(vn, sn) ∈ T (n)ǫ . If there is more than one, it randomly picks one from the set of

typical sequences. If there is none, it picks a random index from [1 : 2nI(V ;S)+δ(ǫ)].

• It then generates an according to ai = f(vi, si) for i ∈ [1 : n].

• At the second encoder, it takes the output yn sequences and sends out the bin

index M such that yn ∈ B(M).


Decoding

• The decoder looks for the unique yn sequence such that (vn(l), yn, zn) ∈ T (n)ǫ

for some l ∈ [1 : 2n(I(V ;S))] and yn ∈ B(M). If there is none or more than one,

it declares an error.

Analysis of probability of error

Define the following error events

E0 := {(V n(L), Zn, Y n) /∈ T (n)ǫ },

El := {(V n(l), Zn, Y n) ∈ T (n)ǫ

for some Y n 6= Y n, Y n ∈ B(M)}.

By symmetry of the codebook generation, it suffices to consider M = 1. The

probability of error is upper bounded by

P(E) ≤ P(E0) +2n(I(V ;S)+δ(ǫ))∑

l=1

P(El).

P(E0) → 0 as n → ∞ following standard analysis of probability of error. It remains

to analyze the second error term. Consider P(El) and define

El(V n, Zn) := {(V n(l), Zn, Y n) ∈ T (n)ǫ for some Y n 6= Y n, Y n ∈ B(1)}

. We have

P(El) = P(El(V n, Zn))

=∑

(vn,zn)∈T (n)ǫ

P(V n(l) = vn, Zn = zn) P(El(vn, zn)|vn, zn)

=∑

(vn,zn)∈T (n)ǫ

(P(V n(l) = vn, Zn = zn).


∑

yn

P(Y n = yn|vn, zn) P(El(vn, zn)|vn, zn, yn))

(a)

≤∑

(vn,zn)∈T (n)ǫ

(P(V n(l) = vn, Zn = zn).

∑

yn

P(Y n = yn|vn, zn)2n(H(Y |Z,V )+δ(ǫ)−R)

)

(b)=

∑

(vn,zn)∈T (n)ǫ

(P(V n(l) = vn) P(Zn = zn).

2n(H(Y |Z,V )+δ(ǫ)−R))

≤(2n(H(V,Z)+δ(ǫ))2−n(H(V )−δ(ǫ))2−n(H(Z)−δ(ǫ)).

2n(H(Y |Z,V )+δ(ǫ)−R))

= 2n(H(Y |V,Z)−I(V ;Z)−R−4δ(ǫ)).

(a) follows since the set of Y n sequences are binned uniformly at random indepen-

dent of other Y n sequences, and the fact that there are at most 2n(H(Y |Z,V )+δ(ǫ)) Y n

sequences which are jointly typical with a given typical (vn, zn). (b) follows from the

fact that the codebook generation is independent of (Sn, Zn). Therefore, for any fixed

l, V n(l) is independent of Zn. Hence, if R ≥ I(V ;S)− I(V ;Z) +H(Y |Z, V ) + 6δ(ǫ),

2n(I(V ;S)+δ(ǫ))∑

l=1

P(El) ≤ 2−nδ(ǫ) → 0,

as n → ∞.

We now turn to the proof of converse for Theorem 7

Proof of converse for Theorem 7

Given a (n, 2nR) code for which the probability of error goes to zero with n and

satisfies the cost constraint, define Vi = (Zn\i, Sni+1, Y

i−1), we have


nR

≥ H(M |Zn)

= H(M,Y n|Zn)−H(Y n|M,Zn)

(a)= H(M,Y n|Zn)− nǫn

= H(Y n|Zn)− nǫn

=n∑

i=1

H(Yi|Y i−1, Zn)− nǫn

=

n∑

i=1

H(Yi|Y i−1, Sni+1, Z

n) + I(Yi;Sni+1|Y i−1, Zn)− nǫn

(b)=

n∑

i=1


n) +n∑

i=1

I(Y i−1;Si|Sni+1, Z

n)− nǫn

(c)=

n∑

i=1


n) +

n∑

i=1

I(Y i−1, Sni+1, Z

n\i;Si|Zi)− nǫn

(d)=

n∑

i=1

H(Yi|Vi, Zi) +

n∑

i=1

I(Vi;Si|Z i)− nǫn

= nH(YQ, |VQ, Q, ZQ) + nI(VQ;SQ|Q,ZQ)− nǫn

where (a) is due to Fano’s inequality. (b) follows from Csiszar sum identity. (c) holds

because (Sn, Zn) is an i.i.d source. Note that the Markov conditions, Vi−(Si, Ai)−Yi

and Vi − Si − Zi hold. Finally, we introduce Q as the time sharing random variable,

i.e., Q ∼ Unif[1, ..., n], and set V = (VQ, Q), Y = YQ and S = SQ, which completes

the proof.

5.3.2 Lossless, causal compression with action

Our next result gives the rate-cost function for the case of lossless, causal compression

with action.


Theorem 8 (Rate-cost function for lossless, causal case). The rate for the compres-

sion with action when the state information is causally available at the action encoder

is given by

R(B) = minp(v),a=f(s,v):EΛ(S,A,U)≤B

H(Y |V, Z) (5.5)

where the joint distribution is of the form p(s, v, a, y) = p(s)p(v)1{f(s,v)=a}p(y|a, s).

Achievability sketch (to write up): Here V simply serves as a time-sharing random

variable. Fix a p(v) and f(s, v). We first generate a V n sequence and reveal it to

the action encoder, the compressor and the decoder. The encoder generates Ai =

f(Si, Vi). The compressor simply bins the set of Y n sequences to 2n(H(Y |V,Z)+ǫ) bins

and sends the index of the bin which contains Y n. The decoder recovers Y n by finding

the unique Y n sequence in bin M such that (V n, Zn, Y n) are jointly typical.

Remark : Just as in the non-causal case, the achievability is closely related to the

channel coding strategy in [11], our achievability in this section uses the “Shannon

Strategy” in [20]. In both cases, the optimal channel coding strategy yield the most

compressible output when the message rate goes to zero.

Proof of Converse: Given a (n, 2nR) code that satisfies the constraints, define

Vi = (Si−1, Zn\i). We have

nR ≥ H(M |Zn)

= H(M,Y n|Zn)−H(Y n|M,Zn)(a)= H(M,Y n|Zn)− nǫn

= H(Y n|Zn)− nǫn

=n∑

i=1

H(Yi|Y i−1, Zi, Zn\i)− nǫn

≥n∑

i=1

H(Yi|Y i−1, Ai−1, Si−1, Zi, Zn\i)− nǫn

(b)=

n∑

i=1

H(Yi|Ai−1, Si−1, Zi, Zn\i)− nǫn

(c)=

n∑

i=1

H(Yi|Vi, Zi)− nǫn


(d)= nH(YQ|VQ, Q, ZQ)− nǫn

where (a) is due to Fano’s inequality; (b) follows from the Markov chain Yi −(Si−1, Ai−1, Zn) − Y i−1 ; (c) follows since Ai−1 is a function of Si−1. Note that Ai

is now a function of Si and Vi. Finally, we introduce Q as the time sharing random

variable, i.e., Q ∼ Unif[1, ..., n]. Thus, by setting V = (VQ, Q) and Y = YQ, we have

completed the proof.

5.3.3 Examples

In this subsection, we consider an example with state sequence Sn ∼ i.i.d. Bern(1/2)

and Z = ∅. We have two actions available, A = 0 and A = 1. The cost constraint is

on the frequency of action A = 1, EA ≤ B. The channel output Yi = Si ⊕ Ai ⊕ SNi

where ⊕ is the modulo 2 sum and {SNi} are i.i.d. Bern(p) noise, p < 1/2. The

example is illustrated in Fig. 5.2.

An

Y n

Y n

++Action

Encoder

EA ≤ B

Compressor DecoderM ∈

{1, .., 2nR}

SnN ∼ i.i.d Bern(p)

Sn ∼ i.i.d Bern(1/2)

Figure 5.2: Binary example with side information Z = ∅.

We use the following lemma to simplify the optimization problem in Eq. (5.4)

applied to the binary example.

Lemma 6. For the binary example, it is without loss of optimality to have the fol-

lowing constraints when solving the optimization problem of Eq. (5.4):

• V = {0, 1, 2}, P(V = 0) = P(V = 1) = θ/2, for some θ ∈ [0, 1].

• The function a = f(s, v) is of the form: f(s, 0) = s, f(s, 1) = 1 − s and

f(s, 2) = 0.


• P(S = 0|V = 1) = P(S = 1|V = 0) = ∆ and P(S = 0|V = 2) = 1/2.

• ∆θ ≤ B.

Note that the constraints guarantee that P(S = 0) = P(S = 1) = 1/2.

Proof. See Appendix. D.1

Using Lemma 6, we can simplify the objective function in Eq. (5.4) in the following

way:

H(Y |V ) + I(V ;S)

= H(Y |V )−H(S|V ) +H(S)

= H(S ⊕ A⊕ SN |V )−H(S|V ) + 1

=θ

2(H(0⊕ SN |V = 0)−H(∆))

+θ

2{H(1⊕ SN |V = 1)−H(∆)}

+(1− θ) {H(S ⊕ SN |V = 2)− 1}+ 1

= θ (H2(p)−H(∆)) + 1

where H2(·) is the binary entropy function, i.e., H2(δ) = −δ log δ− (1− δ) log(1− δ).

R(B) = minθ∈[2B,1], θ∆≤B

θ (H2(p)−H(∆)) + 1

= 1 + min∆∈[B,1/2]

B

∆(H2(p)−H2(∆))

= 1− B max∆∈[B,1/2]

H2(∆)−H2(p)

∆

=

{1− BH(b∗)−H2(p)

b∗, if 0 ≤ B < b∗

1−H2(B) +H2(p), if b∗ ≤ B ≤ 1/2(5.6)

where b∗ is the solution of the following function:

H2(b)−H2(p)

b=

dH2

db, b ∈ [0, 1/2] (5.7)


which is illustrated in Fig. 5.3.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

b

H2(b

)

H2(b) vs b

(p,H(p))

b∗p

H(b∗)−H(p)

b∗

H2(b∗)−H2(p)b∗

= dH2

db

∣∣b=b∗

Figure 5.3: The threshold b∗ solves H2(b)−H2(p)b

= dH2

db, b ∈ [0, 1/2]

Now let us shift our attention to the causal case of the binary example, i.e., Si is

only causally available at the action encoder.

Lemma 7. For the causal case of the binary example, it is without loss of optimality

to have the following constraints when solving the optimization problem in Eq. (5.5):

• V = {0, 1}, P(V = 0) = θ, for some θ ∈ [0, 1].

• The function a = f(s, v) is of the form: f(s, 0) = s, f(s, 1) = 0.

• θ2≤ B.

Proof. See Appendix. D.2

Using Lemma 7, we can simplify the objective function in Eq. (5.5) in the following

way:

R(B)

= minH(Y |V )

= minθ∈[0,1], θ

2≤B

θH(Y |V = 0) + (1− θ)H(Y |V = 1)


= minθ∈[0,1], θ

2≤B

θH(Z|V = 0) + (1− θ)H(S ⊕ Z|V = 1)

= minθ∈[0,1], θ

2≤B

θH2(p) + (1− θ)

=

{2BH2(p) + (1− 2B), 0 ≤ B ≤ 1/2;

H2(p), 1/2 ≤ B.

For the binary example with p = 0.1, we plot the rate-cost function R(B) for both

cases in the following figure.

0 0.1 0.2 0.3 0.4 0.50.4

0.5

0.6

0.7

0.8

0.9

1

Cost Constraint B

R(B

)

R(B) vs. B, p=0.1

CausalNon−Causal

H2(0.1)

Figure 5.4: Comparison between the non-causal and causal rate-cost functions. Theparameter of the Bernoulli noise is set at 0.1.

5.4 Lossy compression with actions

In this section, we extend our setup to the lossy case. We give an achievable rate-

cost-distortion region when Sn is available noncausally at the action encoder and

characterize the rate-cost-distortion function when Sn is available causally at the

encoder and Z = ∅.

Theorem 9. An upper bound on the rate-cost function for the case with non-causal


state information is given by

R(B) ≤ minEΛ(S,A,Y )≤B,Ed(Y,Y )≤D

I(V ;S|Z) + I(Y ; Y |V, Z) (5.8)

where the joint distribution is of the form :

p(s, v, a, y, y, z) = p(s, z)p(v|s)1{f(s,v)=a}p(y|a, s)p(y|y, v).

Sketch of achievability : The codebook generation and the encoding for the action

encoder is largely the same as that for the lossless case. We generate 2n(I(V ;S)+ǫ)

V n sequences according to∏n

i=1 pV (vi), and for each vn, generate 2n(I(Y ;Y V )+ǫ) yn

sequences according to∏n

i=1 p(yi|vi). The set of vn sequences are partitioned to

2n(I(V ;Z)+2ǫ) equal sized bins, B(m0), and for each m0, the set of yn sequeces are

partitioned to 2n(I(Y ;Z|V )+2ǫ) equal sized bins, B(m0, m1). Given a sequence sn, the

action encoder finds the vn sequence which is jointly typical with sn and takes actions

according to Ai = f(si, vi) for i ∈ [1 : n]. At the compressor, we first find a vn

that is jointly typical with Y n and then, a yn such that (vn, yn, yn) ∈ T (n)ǫ . The

compressor then sends the indices M0,M1 such that the selected vn ∈ B(M0) and yn ∈B(M0,M1). The decoder first recovers vn by looking for the unique vn ∈ B(m0) such

that (vn, zn) ∈ T (n)ǫ . next, it recovers yn by looking for the unique yn ∈ B(m0, m1)

such that (vn, zn, yn) ∈ T (n)ǫ . From the rates given, it is easy to see that all encoding

and decoding steps succeed with high probability as n → ∞.

We now turn to the case when sn is causally known at the action encoder. In

this case, we are able to characterize the rate-cost-distortion function when no side

information is available, Z = ∅.

Theorem 10. The rate-cost-distortion function for the case with causal state infor-

mation and no side information is given by

R(B) = mina=f(s,v):EΛ(S,A,U)≤B,Ed(Y,Y )≤D

I(Y ; Y |V, Z) (5.9)

where the joint distribution is of the form p(s, v, a, y, y) = p(s)p(v)1{a=f(s,v)}p(y|a, s)p(y|y, v).

The achievability is straightforward, with V as the time sharing random variable

known to all parties and follows similar analysis as Theorem 8.


Converse: Given a (n, 2nR) code satisfying the cost and distortion conditions, we

have

nR ≥ H(M)

≥ I(M ; Y n)

=n∑

i=1

I(M ; Yi|Y i−1)

(a)=

n∑

i=1

I(M ; Yi|Vi)

(b)

≥n∑

i=1

I(Yi; Yi|Vi)

(c)= nI(YQ; YQ|VQ, Q)

where in (a) we set Vi = Y i−1. (b) holds because Y n is a function of M ; Note that

Vi is independent of Si. In (c) we introduce Q as the time sharing random variable,

i.e., Q ∼ Unif[1, ..., n]. Thus, by setting V = (VQ, Q) and Y = YQ, we have shown

that R(B,D) ≥ I(Y , Y |V ) where V ⊥ S. It is, however, equivalent to the region in

the theorem because of the following:

• Replacing p(y|y, v) by a general distribution p(y|a, y, v, s) does not decrease theminimum in (5.9) since the mutual information term I(Y ; Y |V ) only depends

on the marginal distribution of p(y, y, s).

• Replacing a = f(s, v) by a general distribution p(a|s, v) does not decrease the

minimum in (5.9), because for any joint distribution p(s)p(s|v)p(a|s, v)p(y|a, s)p(y|y, s),I(Y ; Y |V = v) is a concave function in p(y|s, v), which is a linear function of

p(a|s, v).

Chapter 6

Conclusions

In this thesis, we first revisited Gacs and Korner’s definition of common information.

It is equal to the common randomness that two remote nodes, with access to X and

Y respectively, can generate without communication. The fact that this quantity is

degenerate for most cases motivated us to investigate the initial efficiency of common

randomness generation when the communication rate goes to zero. It turned out that

the initial efficiency is equal to 11−ρ(X;Y )2

, where ρ is the Hirschfeld-Gebelein-Renyi

maximal correlation between X and Y . This result gave Hirschfeld-Gebelein-Renyi

maximal correlation an operational justification as a measure of commonness between

two random variables. The result also indicated that communication is the key to

unlock common randomness. And then we turned to the saturation efficiency as the

communication exhausts nature’s randomness. We provided a sufficient condition for

the saturation efficiency to be 1, which implies the continuity of the slope of common

randomness generation at that point. An example was given to show that the slope

is not continuous in general.

In the next part of the thesis, we introduced common randomness generation with

actions, in which a node can take actions to influence the random variables received

from nature. A single letter expression of the common randomness-rate function was

obtained. We showed through an example that the greedy approach of fixing the

“best” action is not optimal in general when communication rate is strictly positive.

But as the rate goes down to zero, the initial efficiency in the action setting was proved

54

CHAPTER 6. CONCLUSIONS 55

to be 11−maxa∈A ρ2(X,Y |A=a)

, i.e., the reciprocal of one minus the square of Hirschfeld-

Gebelein-Renyi maximal correlation conditioned on the best action. The saturation

efficiency with actions was analyzed similarly to the no-action setting.

In the last part of the thesis, we kept the action feature, but shifted our focus to

source coding. The idea that one could modify a source subject to a cost constraint

before compression was formulated in an information theoretical setting. Techniques

from both channel coding and source coding were combined to obtain a single letter

expression of the rate-cost function. In our achievability scheme, modification of the

source sequence is essentially equivalent to setting up cloud centers of the source

sequence. Compression of the modified sequence is carried out via a classic binning

approach. Interestingly, this approach does not require correct decoding of the cloud

center.

Appendix A

Proofs of Chapter 2

A.1 Proof of the convexity of ρ(PX ⊗ PY |X) in PY |X

The inequality holds trivially if any one of the r.v.’s is degenerate (i.e. is equal to a

constant with probability 1). We exclude the case in the following proof.

Fix arbitrary functions f and g such that Eg(X) = Ef(Y, Z) = 0, Eg2(X) = 1,

and Eg2(Y, Z) = 1. Without loss of generality we can assume Eg(X)f(Y, Z) ≥ 0

(otherwise we can consider −f instead of f). Define µ(z) = E[f(Y, Z)|Z = z].

Eg(X)f(Y, Z) (A.1)

= Eg(X)(f(Y, Z)− µ(Z) + µ(Z))

= Eg(X)(f(Y, Z)− µ(Z)) + Eg(X)µ(Z)(a)= Eg(X)(f(Y, Z)− µ(Z)) + Eg(X)Eµ(Z)(b)= Eg(X)(f(Y, Z)− µ(Z)),

where (a) is because X⊥Z and (b) is due to Eg(X) = 0. Define η =√

Ef2(Y,Z)

E(f(Y,Z)−µ(Z))2.

56

APPENDIX A. PROOFS OF CHAPTER 2 57

Note that

E (f(Y, Z)− µ(Z))2 (A.2)

= EZ

[(f(Y, Z)− µ(Z))2 |Z

]

≤ EZ

[f 2(Y, Z)|Z

]

= Ef 2(Y, Z).

Thus η ≥ 1. Consider a new function f ′ = η[f(Y, Z)− µ(Z)]. Note that Ef ′(Y, Z) =

η[Ef(Y, Z) − Eµ(Z)] = 0 and E(f ′(Y, Z))2 = 1. Furthermore Eg(X)f ′(Y, Z) =

ηEg(X)f(Y, Z) ≥ Eg(X)f(Y, Z). Thus it is sufficient to consider f with the property

that E[f(Y, Z)|Z = z] = 0, which enable us to write the optimization problem for

solving ρ(X ; Y, Z) in the following equivalent form:

max Eg(X)f(Y, Z) (A.3)

subject to Eg(X) = 0,

EY |Z=zf(Y, z) = 0 ∀z,Eg(X) = Ef 2(Y, Z) = 1.

Define sz =√

E[f 2(Y, Z)|Z = z]. To simplify the notation, let pz = PZ(z) and

ρz = ρ(PX ⊗ PY |X,Z=z). We have the constraint:

∑

z

pzs2z = 1 (A.4)

Note that

maxEg(X)f(Y, Z) (A.5)

= maxEZ [Eg(X)f(Y, Z)|Z]= max

∑

z

pz [Eg(X)f(Y, Z)|Z = z]

(a)

≤ max∑

z

pzszρ(PX ⊗ PY |X,Z=z)

APPENDIX A. PROOFS OF CHAPTER 2 58

=∑

z

pzszρz

(b)

≤√∑

z

pzρ2z

where (a) is due to the fact that given Z = z, (X, Y ) has joint distribution PX ⊗PY |X,Z=z and (b) is based on the following argument:

Consider the following optimization problem with optimization variable sz, z ∈ Z

max∑

z

pzρzsz

subject to∑

z

pzs2z = 1, sz ≥ 0, ∀z

Using the method of Lagrange multiplier, we construct L(s, λ) =∑

z pzρzsz−λ∑

z pzs2z.

Solving ∂L∂sz

= pzρz−2λpzsz = 0, we obtain sz = ρz/(2λ), ∀z ∈ Z. Using the constraint∑

z pzs2z = 1, we have λ =

√∑z pzρ

2z/2, which yields

√∑z pzρ

2z as the maximum.

This completes the proof of Lemma 1.

Appendix B

Proofs of Chapter 3

B.1 Proof of the continuity of C(R) at R = 0

Fix an arbitrary ǫ > 0.

ǫ ≥ I(X ;U)− I(Y ;U) (B.1)(a)= I(X ;U |Y )

=∑

y

p(y)I(X ;U |Y = y)

=∑

y

p(y)D(p(x, u|Y = y)||p(x|Y = y)p(u|Y = y)).

Thus D (p(x, u|y)||p(x|y)p(u|y)) ≤ ǫminy∈Y p(y)

, ∀y ∈ Y . Via Pinkser’s inequality,∑

x |p(x)− q(x)| ≤√

2ln 2

D(p||q), we obtain

∑

x,u

|p(x, u|y)− p(x|y)p(u|y)| ≤ ǫ′ (B.2)

⇒∑

x,u

p(x|y)|p(u|x, y)− p(u|y)| ≤ ǫ′

⇒∑

x,u

p(x|y)|p(u|x)− p(u|y)| ≤ ǫ′,

59

APPENDIX B. PROOFS OF CHAPTER 3 60

where ǫ′ =√

ǫminy∈Y p(y)

and the last steps is due to the Markov chain U − X − Y .

Thus for each (x, y) pair such that p(x, y) > 0, we have |p(u|x)− p(u|y)| ≤ δ, where

δ = ǫmin(x,y):p(x,y)>0 p(x,y)

.

Let V be maximum common r.v. of p(x, y). There exist deterministic functions

g and f such that V = g(X) = f(Y ). For each v, pick an arbitrary y∗ from the y’s

such that g(y) = y. Thus we create a mapping y∗ = y∗(v).

We claim that for if x and y are in the same block, than |p(u|x) − p(u|y)| ≤ δ′,

where δ′ = (2|X |+1)δ. This is due to the fact that if x and y satisfy g(x) = f(y), then

there exists a sequence (x, y1), (x1, y1), (x1, y2), ..., (xn, y) such that the probability

of each pair is strictly positive [10]. Using triangle inequality:

|p(u|x)− p(u|y)|≤ |p(u|x)− p(u|y1)|+ |p(u|y1)− p(u|y)|≤ δ + |p(u|y1)− p(u|y)|≤ δ + |p(u|x1)− p(u|y1)|+ |p(u|x1)− p(u|y)|≤ 2δ + |p(u|x1)− p(u|y)|...

≤ (2n+ 1)δ

≤ (2|X |+ 1)δ

= δ′

Consider a new distribution p∗(x, y, u) = p(x, y)p(u|y∗(f(y))). Note that ||p − p∗||1goes to zeros as ǫ goes to 0. Therefore limǫ→0 |I(X ;U |V ) − I∗(X ;U |V )| = 0. Fur-

thermore under distribution p∗, X − V − U holds. Thus

limǫ→0

I(X ;U) = limǫ→0

I(X ;U, V )

= I(X ;V ) + limǫ→0

I∗(X ;U |V )

= I(X ;V )

= H(V )

Appendix C

Proofs of Chapter 4

C.1 Converse proof of Theorem 5

We bound the rate R as follows:

nR

≥ H(M)

= I(Xn;M)

= I(Xn;M,Y n)− I(Xn; Y n|M)

=n∑

i=1

I(Xi;M,Y n|X i−1)− I(Xn; Y n|M)

(a)=

n∑

i=1

I(Xi;M,Y n, X i−1)− I(Xn; Y n|M)

(b)=

n∑

i=1

I(Xi;M,Ai, Yn, X i−1)− I(Xn; Y n|M)

=

n∑

i=1

I(Xi;Ai) +

n∑

i=1

I(Xi;M,Y n\i, X i−1|Ai, Yi) +

n∑

i=1

I(Xi; Yi|Ai)− I(Xn; Y n|M)

=n∑

i=1

I(Xi;Ai) +n∑

i=1

I(Xi;M,Y n\i, X i−1|Ai, Yi) +n∑

i=1

[I(Yi;Xi|Ai)− I(Yi;X

n|Y i−1,M)]

61

APPENDIX C. PROOFS OF CHAPTER 4 62

(c)

≥n∑

i=1

I(Xi;Ai) +

n∑

i=1

I(Xi;M,Y n\i, X i−1|Ai, Yi)

=n∑

i=1

I(Xi;Ai) +n∑

i=1

I(Xi;K,M, Y n\i, X i−1|Ai, Yi)−n∑

i=1

I(Xi;K|M,Y n, X i−1, Ai)

(d)=

n∑

i=1

I(Xi;Ai) +

n∑

i=1

I(Xi;K,M, Y n\i, X i−1|Ai, Yi)−n∑

i=1

I(Xi;K|M,Y n, X i−1)

=n∑

i=1

I(Xi;Ai) +n∑

i=1

I(Xi;K,M, Y n\i, X i−1|Ai, Yi)− I(Xn;K|M,Y n)

≥n∑

i=1

I(Xi;Ai) +

n∑

i=1

I(Xi;K,M, Y n\i, X i−1|Ai, Yi)−H(K|M,Y n)

(e)

≥n∑

i=1

I(Xi;Ai) +

n∑

i=1

I(Xi;K,M, Y n\i, X i−1|Ai, Yi)−H(K|K ′)

≥n∑

i=1

I(Xi;Ai) +n∑

i=1

I(Xi;K,M, ,X i−1|Ai, Yi)−H(K|K ′)

(f)=

n∑

i=1

I(Xi;Ai) +

n∑

i=1

I(Xi;Ui|Ai, Yi)−H(K|K ′)

(g)=

n∑

i=1

I(Xi;Ai) +n∑

i=1

I(Xi;Ui|Ai)−n∑

i=1

I(Yi;Ui|Ai)−H(K|K ′)

=

n∑

i=1

I(Xi;Ai, Ui)−n∑

i=1

I(Yi;Ui|Ai)−H(K|K ′)

(h)= n

(I(XQ;AQ, UQ|Q)− I(YQ;UQ|AQ, Q)− H(K|K ′)

n

)

= n

(I(XQ;AQ, UQ, Q)− I(YQ;UQ|AQ)−

H(K|K ′)

n

)

≥ n

(I(XQ;AQ, UQ, Q)− I(YQ;UQ, Q|AQ)−

H(K|K ′)

n

)

where (a) is because Xis’ are i.i.d.; (b) and (d) are due to the fact An is a function

of M ; And (c) comes from the following chain of inequalities:

I(Yi;Xi|Ai)− I(Yi;Xn|Y i−1,M) (C.1)

= I(Yi;Xi|Ai)− I(Yi;Xn|Y i−1,M,An)


= H(Yi|Ai)−H(Yi|Y i−1,M,An)−H(Yi|Xi, Ai) +H(Yi|Xn, Y i−1,M,An)

= H(Yi|Ai)−H(Yi|Y i−1,M,An)−H(Yi|Xi, Ai) +H(Yi|Xi, Ai)

≥ 0

where the third equality comes from the Markov chain Yi−(Xi, Ai)−(Xn\i, Y i−1,M,An\i);

(e) is because K ′ is a function of M and Y n; in (f), we set Ui = (K,M,X i−1).

Note that Ui − (Xi, Ai) − Yi, which justifies (g); in (h), we introduce a time-sharing

random variable Q, which is uniformly distributed on {1, ..., n} and independent of

(Xn, K,M, Y n).

We bound the entropy of K as follows:

H(K)(a)= I(Xn;K) (C.2)

= I(Xi;K|X i−1)

= I(Xi;K,X i−1)

= I(Xi;Ui)

= nI(XQ;UQ|Q)

= nI(XQ;UQ, Q)

where (a) is due to the fact that K is a function of Xn. Set X = XQ, Y = YQ and

U = [UQ, Q], which finishes the proof.

C.2 Proof for initial efficiency with actions

The goal is to prove that

supp(a,u|x)

I(Y ;U |A)I(X ;A) + I(X ;U |A) = max

a∈Aρ2m(X, Y |A = a)

Define

∆1(PA) ={p(a, u|x) :

∑

x

p(a|x)pX(x) = pA(a), ∀a ∈ A}


∆2(δ) ={p(a, u|x) : I(X ;A) + I(X ;U |A) ≤ δ

}

That is ∆(PA) is the set of conditional distributions p(a, u|x) such that the induced

marginal distribution of A is PA and ∆2(δ) is the set of conditional distributions

p(a, u|x) such that I(X ;A) + I(X ;U |A) does not exceed δ.

supp(a,u|x)

I(Y ;U |A)I(X ;A) + I(X ;U |A)

= supPA

supδ≥0

sup∆(PA)

⋂∆2(δ)

I(Y ;U |A)I(X ;A) + I(X ;U |A)

= supPA

limδ↓0

sup∆(PA)

⋂∆2(δ)

I(Y ;U |A)I(X ;A) + I(X ;U |A)

where the last step can be proved by the following lemma on concavity argument:

Lemma 8. Fixing an arbitrary marginal distribution PA, define

f(δ) = sup∆(PA)

⋂∆2(δ)

I(Y ;U |A).

Then f(δ) is concave in δ.

Proof. Fixing the marginal distribution PA, consider any p(a1, u1|x) ∈ ∆(PA)⋂∆2(δ1)

and p(a2, u|x) ∈ ∆(PA)⋂∆2(δ2). Construct p(a, u|x) = λp(a1, u1|x)+(1−λ)p(a2, u2|x).

Introducing a time sharing r.v. Q which equals 1 w.p. λ and 2 w.p. 1− λ. We have

I(X ;A,U,Q) = I(X ;A,U |Q)

= λI(X ;A1, U1|Q1) + (1− λ)I(X ;A2, U2|Q2)

≤ λδ1 + (1− λ)δ2

and

I(Y ;U,Q|A) ≥ I(Y ;U |A,Q)

= λI(Y ;U1|A1) + (1− λ)I(Y ;U2|A2)


Note that (A,U ′) is a valid distribution, where U ′ = [U,Q]. Thus

f(λδ1 + (1− λ)δ2) ≥ λf(δ1) + (1− λ)f(δ2),

which completes the concavity proof of f .

Note that

I(Y ;U |A)I(X ;A) + I(X ;U |A) ≤ max

a:PA(a)>0

I(Y ;U |A = a)

I(X ;U |A = a)

≤ maxa:PA(a)>0

ρ2(PX|A=a ⊗ PY |X,A=a)

where the first inequality is a consequence of [4, Lemma 16.7.1] and the last inequality

comes from Lemma 4.

Therefore

supPA

limδ↓0

sup∆(PA)

⋂∆2(δ)

I(Y ;U |A)I(X ;A) + I(X ;U |A)

≤ suppA

limδ↓0

sup∆(PA)

⋂∆2(δ)

maxa:pA(a)>0

ρ2(PX|A=a ⊗ pY |X,A=a)

(a)= sup

PA

maxa∈A:PA(a)>0

ρ2(pX ⊗ PY |X,A=a)

= maxa∈A

ρ2(PX ⊗ PY |X,A=a)

where (a) can be proved by observing that for a fixed marginal distribution PA, δ ↓ 0

implies that ||PX − PX|A=a||l1 ↓ 0 for a ∈ A, PA(a) > 0 and ρ(P ′X ⊗ PY |X,A=a) as a

function of P ′X is uniformly continuous around P ′

X = PX . This upper bound is actually

achievable. We can simply fix the action a that maximizes maxa∈A ρ2(PX ⊗PY |X,A=a)

and use Lemma 4 to complete the proof.


C.3 Proof of Lemma 5

Set A⊥X

By symmetry, it is without loss of optimality to set A = 1. The maximum common

r.v. V between X and Y has the following form: V =

1, if X = 1;

2, if X = 2;

3, if X = 3, 4.

For any

U such that U −X − Y :

I(U ;X)− I(U ; Y )(a)= I(UV ;X)− I(V U ; Y )

= I(V ;X)− I(V ; Y ) + I(U ;X|V )− I(U ; Y |V )(b)= I(U ;X|V )− I(U ; Y |V )

(c)=

1

2[I(U ;X|V = 3)− I(U ; Y |V = 3)]

(d)=

1

2[I(U ;X|V = 3)− (1− p)I(U ;X|V = 3)]

=p

2I(U ;X|V = 3)

where (a) is due to Lemma. 3; (b) is because V is a deterministic function of X and

a deterministic function of Y ; (c) is due to the fact conditioned on V = 1 or V = 2,

X = Y ; (d) is because condition on V = 3, Y is an erased version of X .

On the other hand

I(U ;X) = I(U, V ;X)

= I(V ;X) + I(U ;X|V )

= H(V ) + I(U ;X|V )

= H(V ) +1

2I(U ;X|V = 3)

=3

2+

1

2I(U ;X|V = 3)

Thus the achievable (C,R) pair when A⊥X is of the form C = 32+ R

p. Note that

0 ≤ I(U ;X|V = 3) ≤ 1 thus 0 ≤ R ≤ p/q.


Correlate A with X through Fig. 4.3

We construct a r.v. V in the following way to facilitate the proof: V has the support

set {1, 2, 3} and is a deterministic function of (X,A):

• If A = 1, V =

1, if X = 1;

2, if X = 2;

3, if X = 3, 4.

• If A = 2, V =

1, if X = 3;

2, if X = 4;

3, if X = 1, 2.

Note that conditioned on A, V is the maximum common r.v. between X and Y . We

simply set U = V (This is not optimal in general but good enough to beat the A⊥X

choice for some R). The communication rate is

I(X ;A) + I(V ;X|A)− I(V ; Y |A) = I(X ;A) +H(V |A)−H(V |A)= I(X ;A)

= 1−H2(α)

On the other hand, the common randomness generated is

I(X ;A, V ) = I(X ;V |A) + I(X ;A)

= 1−H2(α) + I(X ;V |A)= 1−H2(α) +H(V |A)= 1−H2(α) +H(α,

1− α

2,1− α

2)

= 2− α

Appendix D

Proofs of Chapter 5

D.1 Proof of Lemma 6

Fixing a v, the function a = f(s, v) has only four possible forms: a = s, a = 1 − s,

a = 0 and a = 1. Thus, we can divide V into four groups:

V0 = {v : f(s, v) = s}V1 = {v : f(s, v) = 1− s}V2 = {v : f(s, v) = 0}V3 = {v : f(s, v) = 1} (D.1)

First, it is without loss of optimality to set V3 = ∅. That is because for each v ∈ V3,

we can change the function to f(s, v) = 0. The rate I(V ;S) + H(Y |V ) does not

change and the cost EA only decreases.

Rewrite the objective function in the following way

I(V ;S) +H(Y |V ) = H(Y |V )−H(S|V ) +H(S)

= H(S ⊕ A⊕ Z|V )−H(S|V ) +H(S)

=∑

v∈V0

(H2(p)−H(S|V = v)

)p(v)

68

APPENDIX D. PROOFS OF CHAPTER 5 69

+∑

v∈V1

(H2(p)−H(S|V = v)

)p(v)

+∑

v∈V2

(H(S ⊕ SN |V = v)−H(S|V = v)

)p(v)

where the last step is obtained by plugging in the actual form of a = f(s, v) for each

group of v.

Second, it is sufficient to have |V0| = 1 and |V1| = 1. To prove this, let v1, v2 ∈ V0.

Note that H(S|V = v) is a concave function in p(s|V = v). Thus if we replace v1, v2

by a v3 with p(v3) = p(v1) + p(v2) and

p(s|V = v3) =p(v1)

p(v1) + p(v2)p(s|V = v1) +

p(v2)

p(v1) + p(v2)p(s|V = v2),

we preserve the distribution of S, the cost EA but we reduce the first term, i.e.,∑

v∈V0

(H2(p)−H(S|V = v)

)p(v), in Eq. (D.2). Therefore, we can set V0 = {0} and

V1 = {1}.Third, note that for each v ∈ V2,

H(Y |V = v)−H(S|V = v)

= H(S ⊕ A⊕ Z|V = v)−H(S|V = v)

= H(S ⊕ SN |V = v)−H(S|V = v)

≥ 0 (D.2)

Last, if P(S = 0|V = 0) 6= P(S = 1|V = 1), consider a new auxiliary random

variable V ′ with the following distribution:

• V ′ = {0, 1, 2}, P(V ′ = 0) = P(V ′ = 1) = (P(V = 0) + P(V = 1))/2

• The function a = f(s, v′) is of the form: f(s, 0) = s, f(s, 1) = 1 − s and

f(s, 2) = 0.

• P(S = 0|V ′ = 2) = 1/2 and

APPENDIX D. PROOFS OF CHAPTER 5 70

P(S = 1|V ′ = 0) = P(S = 0|V ′ = 1)

=P(S = 1|V = 0)P(V = 0) + P(S = 0|V = 1)P(V = 1)

P(V = 0) + P(V = 1).

Comparing (S, V ′) with (S, V ), we can check that the cost EA and the distribution

of S are preserved. Meanwhile, the objective function is reduced, which completes

the proof.

D.2 Proof of Lemma 7

Similar to the proof of Lemma 6, we divide V in to V0,V1,V2,V3. Using the same

argument, we show that V3 = ∅. Rewrite the objective function H(Y |V ) in the

following way:

H(Y |V ) (D.3)

= H(S ⊕ A⊕ SN |V )

=∑

v∈V0

H2(p)p(v) +∑

v∈V1

H2(p)p(v) +∑

v∈V2

(H(S ⊕ SN |V = v)p(v)

= H2(p)∑

v∈V0⋃

V1

p(v) +∑

v∈V2

p(v),

which implies that it is sufficient to consider the case |V0| = 1, V1 = ∅ and |V2| = 1.

And this completes the proof.

Bibliography

[1] R. Ahlswede and I. Csiszar, “Common Randomness in Information Theory and

Cryptography – Part I: Secret sharing ”, IEEE Trans. Inf. Theory, vol. 39, no.

4, , pp. 1121– 1132, January, 1998.

[2] R. Ahlswede and I. Csiszar, “Common Randomness in Information Theory and

Cryptography – Part II: CR Capacity”, IEEE Trans. Inf. Theory, vol. 44, no. 1,

, pp. 225–240, January, 1998.

[3] R. F. Ahlswede, and J. Korner, “Source coding with side information and a

converse for degraded broadcast channels,” IEEE Trans. Inf. Theory, vol. 21,

no. 6, pp. 629-637, 1975.

[4] T. Cover and J. Thomas, “Elements of Information Theory”, John Wiley&Sons,

2nd Edition, 2006.

[5] I Csiszar and P. Narayan, “Common Randomness and Secret Key Generation

with a Helper”, IEEE Trans. Inf. Theory, vol. 46, no. 2, pp. 344–366, March,

2000.

[6] P. Cuff, T. Cover, and H. Permuter, “Coordination capacity,” IEEE Trans. Inf.

Theory, vol. 56, no. 9, pp. 4181–4206, September 2010.

[7] A. Dembo, A. Kagan, and L. A. Shepp, “Remarks on the maximum correlation

coefficient”, Bernoulli, no. 2, pp. 343–350, April 2001.

[8] A. El Gamal, and Y. H. Kim, “Lectures on Network Information Theory,” 2010,

available online at ArXiv: http://arxiv.org/abs/1001.3404.

71

BIBLIOGRAPHY 72

[9] E. Erkip and T. Cover, “The Effciency of Investment Information”, IEEE Trans.

Inf. Theory, vol. 44, no. 3, pp. 1026–1040, May 1998.

[10] P. Gacs and J. Korner, “Common information is far less than mutual informa-

tion”, Problems of Control and Information Theory, vol. 2, no. 2, pp. 119-162,

1972

[11] S. I. Gelfand and M. S. Pinsker, “Coding for Channel with Random Parameters,”

Probl. Contr. and Inform. Theory, vol. 9, no. I, pp. 1931, 1980.

[12] H. Gebelein, “Das statistische problem der Korrelation als variationsund Eigen-

wertproblem und sein Zusammenhang mit der Ausgleichungsrechnung,” Z. fur

angewandte Math. und Mech., vol. 21, pp. 364-379, 1941.

[13] C. Heegard and A. El Gamal,“On the Capacity of Computer Memory with De-

fects,” IEEE Trans. Inform. Theory, vol. 29, no. 5, pp. 731739, September 1983

[14] H. O. Hirschfeld, “A connection between correlation and contingency,” Proc.

Cambridge Philosophical Soc., vol. 31, pp. 520-524, 1935

[15] Y. H. Kim, A. Sutivong, and T.M. Cover, “State amplification,”, IEEE Trans.

Inform. Theory,” Vol. 54, no. 5, pp. 1850 – 1859, May 2008

[16] H. O. Lancaster, “ Some properties of the bivariate normal distribution consid-

ered in the form of a contingency table,” Biometrika, 44, pp. 289–292, 1957

[17] Pulkit Grover, Aaron Wagner, and Anant Sahai, “Information Embedding meets

Distributed Control”, IEEE Information Theory Workshop, January 2010 in

Cairo, Egypt.

[18] A. Renyi, “On measures of dependence”, Acta Mathematica Hungarica, vol. 10,

no.3-4, pp.441–451, 1959.

[19] C. Shannon, “A Mathematical Theory of Communication,” Bell System Techni-

cal Journal, Vol. 27, pp. 379-423, 623-656, 1948.

BIBLIOGRAPHY 73

[20] C. Shannon, “Channels with side information at the transmitter,” IBM J. Res.

Develop., Vol. 2, pp. 289-293, 1958.

[21] D. Slepian, J. Wolf, “Noiseless coding of correlated information sources”, IEEE

Trans. Inf. Theory, vol. 19, no. 4, pp. 471–480.

[22] S. Sigurjonsson, and Y. H. Kim, “On multiple user channels with causal state in-

formation at the transmitters,” in Proceedings of IEEE International Symposium

on Information Theory, Adelaide, Australia, Sep. 2005

[23] A. Sutivong, and T. Cover, “Rate vs. Distortion Trade-off for Channels with

State Information”, in Proceedings of the 2009 IEEE Symposium on Information

Theory, Lausanne, Switzerland, June 2002.

[24] O. Sumszyk, and Y. Steinberg, “Information embedding with reversible stego-

text”, in Proceedings of the 2009 IEEE Symposium on Information Theory, Seoul,

Korea, Jun. 2009

[25] N. Tishby, F.C. Pereira, and W. Bialek, “The Information Bottleneck method,”

The 37th annual Allerton Conference on Communication, Control, and Comput-

ing, Sept. 1999, pp. 368-377

[26] S. Verdu, “On channel capacity per unit cost,” IEEE Trans. Inf. Theory, vol.

36, no. 5, pp. 1019–1030, September 1990.

[27] T. Weissman and H. Permuter, “Source Coding with a Side Information ‘Vending

Machine’ ”, IEEE Trans. Inf. Theory, submitted 2009.

[28] H. S. Witsenhausen, “ On sequences of pairs of dependent random variables.”,

SIAM J. APPL. Math. vol. 28, no. 1, January 1975.

[29] A. Wyner, and J. Ziv, “A theorem on the entropy of certain binary sequences

and applications-I,” IEEE Trans. Inf. Theory, vol. 19, no. 6, pp. 769-772, 1973.

[30] A. Wyner and J. Ziv, “The rate distortion function for source coding with side

information at the receiver”, IEEE Trans. Inf. Theory, vol. 22, no. 1, pp. 1–10,

1976