COMMON RANDOMNESS, EFFICIENCY, AND ACTIONS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Lei Zhao
August 2011
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/bn436fy2758
© 2011 by Lei Zhao. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Thomas Cover, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Itschak Weissman, Co-Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Abbas El-Gamal
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Preface
The source coding theorem and channel coding theorem, first established by Shannon
in 1948, are the two pillars of information theory. The insight obtained from Shan-
non’s work greatly changed the way modern communication systems were thought
and built. As the original ideas of Shannon were absorbed by researchers, the mathe-
matical tools in information theory were put to great use in statistics, portfolio theory,
complexity theory, and probability theory.
In this work, we explore the area of common randomness generation, where remote
nodes use nature’s correlated random resource and communication to generate a
random variable in common. In particular, we investigate the initial efficiency of
common randomness generation as the communication rate goes down to zero, and
the saturation efficiency as the communication exhausts nature’s randomness. We
also consider the setting where some of the nodes can generate action sequences to
influence part of nature’s randomness.
At last, we consider actions in the framework of source coding. The tools from
channel coding and distributed source coding are combined to establish the funda-
mental limit of compression with actions.
iv
Acknowledgements
The five years I spent at Stanford doing my Ph.D. have been a very pleasant and
fulfilling journey. And it is my advisor Thomas Cover, who made it possible. His
weekly round-robin group meeting was the best place for research discussion and was
also full of interesting puzzles and stories. He revealed the pearls of information theory
as well as statistics through all those beautiful examples and always encouraged me
on every small findings I obtained. It is a privilege to work with him and I would like
to thank him for his support, and guidance.
I am also truly grateful to Professor Tsachy Weissman, who taught me amazing
universal schemes in information theory and was always willing to let me do “random
drawing” on his white boards. I really like his way of asking have-we-convinced-
ourselves questions, which often led to surprisingly simple yet insightful discoveries.
Professor Abbas El Gamal is of great influence on me. I would like to extend my
sincere thanks to him. His broad knowledge on network information theory and his
teaching of EE478 were invaluable to my research.
I would like to thank my colleagues at Stanford, especially, Himanshu Asnani,
Bernd Bandemer, Yeow Khiang Chia, Paul Cuff, Shirin Jalali, Gowtham Kumar,
Vinith Misra, Alexandros Manolakos, Taesup Moon, Albert No, Idoia Ochoa, Haim
Permuter, Han-I Su, and Kartik Venkat.
Last but not least, I am grateful to my family. I thank my parents for their
constant support and love. I thank my wife for her love, and for completing my life.
v
Contents
Preface iv
Acknowledgements v
1 Introduction 1
2 Hirschfeld-Gebelein-Renyi maximal correlation 4
2.1 HGR correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Doubly symmetric binary source . . . . . . . . . . . . . . . . . 7
2.2.2 Z-Channel with Bern(1/2) input . . . . . . . . . . . . . . . . . 7
2.2.3 Erasure Channel . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Common randomness generation 11
3.1 Common randomness and efficiency . . . . . . . . . . . . . . . . . . . 11
3.1.1 Common randomness and common information . . . . . . . . 13
3.1.2 Continuity at R = 0 . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Initial Efficiency (R ↓ 0) . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Efficiency at R ↑ H(X|Y ) (saturation efficiency) . . . . . . . . 16
3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 DBSC(p) example . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Gaussian example . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Erasure example . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vi
3.3.1 CR per unit cost . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Secret key generation . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Non-degenerate V . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.4 Broadcast setting . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Common randomness generation with actions 28
4.1 Common randomness with action . . . . . . . . . . . . . . . . . . . . 28
4.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Initial Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Saturation efficiency . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Compression with actions 37
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Lossless case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.2 Lossy case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.3 Causal observations of state sequence . . . . . . . . . . . . . . 40
5.3 Lossless case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.1 Lossless, noncausal compression with action . . . . . . . . . . 40
5.3.2 Lossless, causal compression with action . . . . . . . . . . . . 46
5.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Lossy compression with actions . . . . . . . . . . . . . . . . . . . . . 51
6 Conclusions 54
A Proofs of Chapter 2 56
A.1 Proof of the convexity of ρ(PX ⊗ PY |X) in PY |X . . . . . . . . . . . . 56
B Proofs of Chapter 3 59
B.1 Proof of the continuity of C(R) at R = 0 . . . . . . . . . . . . . . . . 59
vii
C Proofs of Chapter 4 61
C.1 Converse proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . 61
C.2 Proof for initial efficiency with actions . . . . . . . . . . . . . . . . . 63
C.3 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
D Proofs of Chapter 5 68
D.1 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
D.2 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Bibliography 71
viii
List of Tables
ix
List of Figures
1.1 Generate common randomness: K = K(Xn), K ′ = K ′(Y n) satisfy-
ing P(K = K ′) → 1 as n → ∞. What is the maximum common
randomness per symbol, i.e. what is sup 1nH(K)? . . . . . . . . . . . 2
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 X ∼ Bern(1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 ρ(X ; Y ) = 1− 2min{p, 1− p} . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Z-channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 ρ(X ; Y ) =√
1−p1+p
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Erasure Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Common Randomness Capacity: (Xi, Yi) are i.i.d.. Node 1 generates
a r.v. K based on the Xn sequence it observes. It also generates a
message M and transmits the message to Node 2 under rate constraint
R. Node 2 generates a r.v. K ′ based on the Y n sequence it observes
and M . We require that P(K = K ′) approaches 1 as n goes to infinity.
The entropy of K measures the amount of common randomness those
two nodes can generate. What is the maximum entropy of K? . . . 12
3.2 The probability structure of Un. . . . . . . . . . . . . . . . . . . . . . 17
3.3 DBSC example: X ∼ Bern(1/2), pY |X(x|x) = (1− p), pY |X(1−x|x) = p. 18
3.4 C(R) for p = 0.08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Gaussian Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Auxiliary r.v. U in Gaussian example. . . . . . . . . . . . . . . . . . 21
x
3.7 Gaussian example: C(R) for N = 0.5 . . . . . . . . . . . . . . . . . . 22
3.8 Erasure example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.9 Erasure example: C − R curve . . . . . . . . . . . . . . . . . . . . . . 23
3.10 Common randomness per unit cost. . . . . . . . . . . . . . . . . . . . 24
3.11 Secret Key Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.12 CR broadcast setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Common Randomness Capacity: {Xi}i=1,... is an i.i.d. source. Node
1 generates a r.v. K based on the Xn sequence it observes. It also
generates a message M and transmits the message to Node 2 under
rate constraint R. Node 2 first generates an action sequence An as a
function of M and receives a sequence of side information Y n, where
Y n|(An, Xn) ∼ p(y|a, x). Then Node 2 generates a r.v. K ′ based on
both M and Y n sequence it observes and M . We require P(K = K ′)
to be close to 1. The entropy of K measures the amount of common
randomness those two nodes can generate. What is the maximum
entropy of K? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 CR with Action example . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Correlate A with X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 CR with action example: option one: set A⊥X ; option two: correlate
A with X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Compression with actions. The Action encoder first observes the state
sequence Sn and then generates an action sequence An. The ith out-
put Yi is the output of a channel p(y|a, s) when a = Ai and s = Si.
The compressor generates a description M of 2nR bits to describe Y n.
The remote decoder generates Y n based on M and it’s available side
information Zn as a reconstruction of Y n. . . . . . . . . . . . . . . . 38
5.2 Binary example with side information Z = ∅. . . . . . . . . . . . . . . 48
5.3 The threshold b∗ solves H2(b)−H2(p)b
= dH2
db, b ∈ [0, 1/2] . . . . . . . . . 50
5.4 Comparison between the non-causal and causal rate-cost functions.
The parameter of the Bernoulli noise is set at 0.1. . . . . . . . . . . . 51
xi
Chapter 1
Introduction
Given a pair of random variables (X, Y ) with joint distribution p(x, y), what do they
have in common? Different quantities can be justified as the right measure of “com-
mon” in different settings. For example, in linear estimation, correlation determines
the minimum mean square error (MMSE) when we use one random variable to esti-
mate the other. And the MMSE suggests that the larger the absolute value of the
correlation, the more “commonness” X and Y have. In information theory, insight
about p(x, y) can often be gained when independent and identically distributed (i.i.d.)
copies, (Xi, Yi), i = 1, ..., n, are considered. In source coding with side information,
the celebrated Slepian-Wolf Theorem [21] shows that when compressing {Xi}ni=1 loss-
lessly, the rate reduction by having side information {Yi}ni=1 is the mutual information
I(X ; Y ) between X and Y . It makes a lot of sense that a large rate reduction suggests
a lot in common between X and Y , which indicates that I(X ; Y ) is a good measure.
A more direct attempt addressing the commonness was first considered by Gacs
and Korner in [10]. In their setting, illustrated in Fig. 1.1, nature generates (Xn, Y n) ∼i.i.d. p(x, y). Node 1 observes Xn, and Node 2 observes Y n. The task is for the two
nodes to generate common randomness (CR), i.e., a random variable K in common.
The entropy of the common random variable is the number of common bits gener-
ated by nature’s resource at either node. The supremum of the normalized entropy,1nH(K), is defined as the common information between X and Y . It would be an
extremely interesting measure of commonness if not for the fact that it is zero for a
1
CHAPTER 1. INTRODUCTION 2
Xn Y n
K K ′
Node 1 Node 2
Figure 1.1: Generate common randomness: K = K(Xn), K ′ = K ′(Y n) satisfyingP(K = K ′) → 1 as n → ∞. What is the maximum common randomness per symbol,i.e. what is sup 1
nH(K)?
large class of joint distributions. Witsenhausen [28] used Hirschfeld-Gebelein-Renyi
maximal correlation (HGR correlation) to sharpen the result by Gacs and Korner.
Surprisingly, if the HGR correlation between X and Y is strictly less than 1, not a
single bit in common can be generated by the two nodes.
In this thesis, we investigate the role of HGR correlation in common randomness
generation with a rate-limited communication link between Node 1 and Node 2, with
and without actions. In particular, we link the HGR correlation with initial efficiency,
i.e., the initial rate of common randomness unlocked by communication, thus giving
an operational justification of using HGR correlation as a measure of commonness.
Furthermore, we extend common randomness generation to the setting where one
node can take actions to affect the side information. A single letter expression for
common randomness capacity is obtained, based on which the initial efficiency and
saturation efficiency are derived. The maximum HGR correlation conditioned on a
fixed action determines the initial efficiency.
In the last chapter we consider the problem of compression with actions. While
traditionally in source coding, nature fixes the source distribution, in our setting, we
introduce the idea of using actions to affect nature’s source.
Notation: We use capital letter X to denote a random variable, small letter x
to denote the corresponding realization, calligraphic letter X to denote the alphabet
of X , and |X | to denote the cardinality of the alphabet. The subscripts in joint
CHAPTER 1. INTRODUCTION 3
distributions are mostly omitted. For example pXY (x, y) is written as p(x, y). But
to emphasize the probability structure, we sometimes write the joint distribution as
PX⊗PY |X , where PX is the marginal of X and PY |X as the conditional distribution of
Y given X . We use X ⊥ Y to indicate that X and Y are independent, and X−Y −Z
to indicate that X and Z are conditionally independent given Y . Subscripts and
superscripts are used in the standard way: Xn = (X1, ..., Xn) and Xji = (Xi, ..., Xj).
Most of the notations follow [8].
Chapter 2
Hirschfeld-Gebelein-Renyi
maximal correlation
2.1 HGR correlation
We focus on random variables with finite alphabet.
Definition 1. The HGR correlation [12,14,18] between two random variables (r.v.)
X and Y , denoted as ρ(X ; Y ), is defined as
ρ(X ; Y ) = max Eg(X)f(Y ) (2.1)
subject to Eg(X) = 0, Ef(Y ) = 0,
Eg2(X) ≤ 1, Ef 2(Y ) ≤ 1.
If neither X nor Y is degenerate, i.e., a constant, then the inequalities can be
replaced by equality in the constraints. An equivalent characterization was proved
by Renyi in [18]:
ρ2(X ; Y ) = supEg(Y )=0,Eg2(Y )≤1
E[E2(g(Y )|X)
](2.2)
Note that HGR correlation is a function of the joint distribution p(x, y) and does not
dependent on the support of X and Y . We sometimes use ρ(p(x, y)) or ρ(PX ⊗PY |X)
4
CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 5
to emphasize the joint probability distribution. The HGR correlation shares quite a
few properties with mutual information ρ(X ; Y ).
• Positivity [18]: 0 ≤ ρ(X ; Y ) ≤ 1
◦ ρ(X ; Y ) = 0 iff X⊥Y .
◦ ρ(X ; Y ) = 1 iff there exists a non-degenerate random variable V such that
V is both a function of X and a function of Y .
• Data processing inequality: If X ,Y and Z form a Markov chain X−Y −Z,
then ρ(X ; Y ) = ρ(X ; Y, Z) ≥ ρ(X ;Z).
Proof. Consider any function g such that Eg(X) = 0, Eg2(X) = 1. By the
Markovity X − Y − Z, E [E2(g(X)|Y )] = E [E2(g(X)|Y, Z)]. Thus using the
alternative characterization Eq.(2.2), we have
ρ(X ; Y ) = ρ(X ; Y, Z) ≥ ρ(X ;Z).
• Convexity: Fixing PX , ρ2(PX ⊗ PY |X) is convex in PY |X .
Proof. Consider r.v.’sX, Y1, Y2. Let 0 < λ < 1, and let Θ =
{1, w.p. λ;
2, w.p. 1− λ,
where Θ is independent of (X, Y1, Y2). Let Y = YΘ. We have
ρ2(X ; YΘ) ≤ ρ2(X ; YΘ,Θ)
≤ λρ2(X ; Y1) + (1− λ)ρ2(X ; Y2),
where last inequality comes from the following lemma.
Lemma 1. Assume X⊥Z, where Z has a finite alphabet Z. Let ρ(X ; Y, Z) be
the Renyi correlation between X and (Y, Z).
ρ2(X ; Y, Z) ≤∑
v
PZ(z)ρ2(X ; Y |Z = z), (2.3)
CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 6
where ρ(X ; Y |Z = z) = ρ(PX ⊗ PY |X,Z=z).
Proof. See Appendix A.1.
However, we note here that ρ2(PX ⊗PY |X) is not concave in PX when fixing PY |X ,
which differs from mutual information. We provide a numerical example:
Consider P1 = [1/2, 1/4, 1/4]T and P2 = [1/3, 1/3, 1/3]. Let Pθ = θP1+ (1− θ)P2.
We show the plots of ρ2(Pθ ⊗PY |X) as a function of θ for two different PY |X matrices
in the following figures:
0 0.2 0.4 0.6 0.8 10.097
0.098
0.099
0.1
0.101
0.102
0.103
0.104
0.105
0.106
0.107
Time sharing
ρ2 (X;Y
)
θ
Figure 2.1:
0 0.2 0.4 0.6 0.8 10.089
0.09
0.091
0.092
0.093
0.094
0.095
0.096
0.097
0.098
Time sharing
ρ2 (X;Y
)
θ
Figure 2.2:
For the left figure, PY |X =
0.0590 0.4734 0.4677
0.3252 0.2415 0.4333
0.1778 0.6230 0.1992
, and for the right figure,
PY |X =
0.3162 0.6139 0.0699
0.6351 0.2702 0.0948
0.5519 0.3570 0.0911
.
2.2 Examples
In this section, we calculate the HGR correlation for a few simple examples.
CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 7
2.2.1 Doubly symmetric binary source
Let X be a Bern(1/2) r.v. and Y be the output of a binary symmetric channel with
cross probability p and input X . Since X and Y are binary, the HGR correlation can
be easily computed as [9]
ρ(X ; Y ) = 1− 2min{p, 1− p}.
When p = 0, i.e., X and Y are independent, ρ = 0; when p = ±1, X and Y are
X Y
00
11
p
p
1− p
1− p
Figure 2.3: X ∼ Bern(1/2)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
ρ(X
;Y)
Figure 2.4: ρ(X ; Y ) = 1− 2min{p, 1− p}
essentially identical and ρ achieves its maximum value 1. These values agrees with
one’s intuition about the commonness measure on X and Y .
2.2.2 Z-Channel with Bern(1/2) input
Let X be a Bern(1/2). And let us consider the Z-channel 2.5 of with probability p,
that relates X and Y . The HGR correlation can be computed as
ρ(X ; Y ) =
√1− p
1 + p
Note that ρ2(X ; Y ) = −1 + 21+p
, which is a convex function in p.
CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 8
X Y
00
11
1
p
1− p
Figure 2.5: Z-channel
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
ρ(X
;Y)
Figure 2.6: ρ(X ; Y ) =√
1−p1+p
2.2.3 Erasure Channel
Let (X, Y ) be two random variables with some general distributions p(x, y) and Y be
an erased version of Y with erasure probability q, shown in Fig. 2.7.
X Yp(x, y) Y
e
Figure 2.7: Erasure Channel
It is shown that an erasure erases a portion q of the information between X
and Y , i.e., I(X ; Y ) = (1 − q)I(X ; Y ) [4]. Interestingly, for HGR correlation (HGR
correlation squared to be precise), a similar property holds as proved in the following
lemma:
Lemma 2.
ρ2(X ; Y ) = (1− q)ρ2(X ; Y )
Proof. If either X or Y is degenerate, the proof is trivial. Thus, we only consider
the case where neither X nor Y is degenerate. Define Θ = 1{Y=e}. Note that Θ is
independent of X . For any f and g such that Ef(X) = 0, Ef 2(X) = 1, Eg(Y ) = 0
CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 9
and Eg2(Y ) = 1, we have
Eg(X)f(Y )
= EΘE[g(X)f(Y )|Θ]
= qE[g(X)f(e)|Θ = 1] + (1− q)E[g(X)f(Y )|Θ = 0](a)= qf(e)Eg(X) + (1− q)E[g(X)f(Y )](b)= (1− q)E[g(X)f(Y )]
= (1− q)E[g(X)]E[f(Y )] + (1− q)E{(g(X)−E[g(X)])(f(Y )− E[f(Y )])
}
(c)= (1− q)E[(g(X)−E[g(X)])(f(Y )− E[f(Y )])](d)
≤ (1− q)√Var(g(X)) Var(f(Y ))ρ(X ; Y )
(e)
≤ (1− q)
√1
1
1− qρ(X ; Y )
= ρ(X ; Y )√
1− q,
where (a) is due to the independence between Θ and X ; (b) and (c) is due to the fact
Ef(X) = 0; (d) comes from the definition of HGR correlation between X and Y ; (e)
is because
1 = Eg2(Y )
= E[g2(Y )|Θ]
= qEg2(e) + (1− q)Eg2(Y )
≥ (1− q)Eg2(Y )
≥ (1− q) Var(g(Y ));
Equality can be achieved by setting f = f ∗ and g(y) =
{1√1−q
g∗(y), if y ∈ Y ,
0, y = e.where f ∗ and g∗ are the functions achieving the HGR correlation between X and Y .
Thus ρ2(X ; Y ) = (1− q)ρ2(X ; Y ).
There are some other interesting non-trivial examples in the literature:
CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION10
• Gaussian: [16] If (X, Y ) ∼ joint Gaussian distribution with correlation r, then
ρ(X ; Y ) = |r|.
• Partial sums: [7] Let Yi, i = 1, ..., n be i.i.d. r.v.’s with finite variance. Let
Sk =∑k
i=1 Yi, k = 1, .., n be the partial sums. Then we have
ρ(Sk, Sn) =√
k/n.
Chapter 3
Common randomness generation
3.1 Common randomness and efficiency
If the HGR correlation between X and Y is close to 1, intuitively there is a lot
in common between the two. The obstacle of generating common randomness in
Fig. 1.1 is the lack of communication between the two nodes. It turns out that a
communication link can be of great help facilitating common randomness generation.
This setting was first considered by Ahlswede and Csiszar [2]. The system has two
nodes. Node 1 observes Xn and Node 2 observes Y n, where (Xn, Y n) ∼ i.i.d. p(x, y).
A (n, 2nR) scheme, shown in Fig. 3.1, consists of
• a message encoding function fm : X n 7→ [1 : 2nR], M = fm(Xn),
• a CR encoding function at Node 1 f1: K = f1(Xn),
• a CR encoding function at Node 2 f2: K′ = f2(M,Y n).
Definition 2. A common randomness-rate pair (C,R) is said to be achievable if there
exists a sequence of schemes at rate R such that
• P(K = K ′) → 1 as n → ∞,
• lim infn→∞1nH(K) → C,
11
CHAPTER 3. COMMON RANDOMNESS GENERATION 12
• lim supn→∞1nH(K|K ′) → 01.
In words, the entropy of K measures the amount of common randomness those
two nodes can generate.
Xn Y n
M ∈ [1 : 2nR]
K K ′
Node 1 Node 2
Figure 3.1: Common Randomness Capacity: (Xi, Yi) are i.i.d.. Node 1 generates ar.v. K based on the Xn sequence it observes. It also generates a message M andtransmits the message to Node 2 under rate constraint R. Node 2 generates a r.v. K ′
based on the Y n sequence it observes and M . We require that P(K = K ′) approaches1 as n goes to infinity. The entropy ofK measures the amount of common randomnessthose two nodes can generate. What is the maximum entropy of K?
Definition 3. The supremum of all the common randomness achievable at rate R is
defined as the common randomness capacity at rate R. That is
C(R) = sup{C : (C,R) is achievable}
Theorem 1. [2] The common randomness capacity at rate R is
C(R) = maxp(u|x):R≥I(X;U)−I(Y ;U)
I(X ;U) (3.1)
If private randomness generation is allowed at Node 1, then
C(R) =
{maxp(u|x):I(U :X)−I(U ;Y )≤R I(U ;X), R ≤ H(X|Y );
R + I(X ; Y ), R > H(X|Y ).(3.2)
1This is a technical condition to constrain the cardinality of K. Mathematically, the conditionguarantees that the converse proof works out
CHAPTER 3. COMMON RANDOMNESS GENERATION 13
The C(R) curve is thus a straight line with slope 1 for R > H(X|Y ). Although we
focus on 0 ≤ R ≤ H(X|Y ) in this thesis, in most of the figures we plot the straight
line part for completeness.
We note here that computing C(R) is highly related with the information bottle
neck method developed in [25]. The usage of common randomness in generating
coordinated actions is discussed in detail in [6].
3.1.1 Common randomness and common information
Let us clarify the relation between common randomness and common information.
Definition 4. [10] The maximum common r.v. V between X and Y satisfies:
• There exists functions g and f such that V = g(X) = f(Y ).
• For any V ′ such that V ′ = g′(X) = f ′(Y ) for some deterministic functions f ′
and g′, V ′ is a function of V .
Definition 5. [10] The common information between X and Y is defined as H(V )
where V is the maximum common r.v. between X and Y .
It turns out that common information is equal to common randomness at rate 0,
i.e., H(V ) = C(0) [5].
Lemma 3. It is without loss of optimality to assume that V is a function of U when
optimizing maxp(u|x):R≥I(X;U)−I(Y ;U) I(X ;U).
Proof. Because for any U such that I(X ;U)− I(Y ;U) ≤ R and U −X − Y hold, we
can construct a new auxiliary r.v. U ′ = [U, V ]. Note that
• Markov chain U ′ −X − Y holds.
CHAPTER 3. COMMON RANDOMNESS GENERATION 14
• The rate constraint:
I(X ;U ′)− I(Y ;U ′)
= I(X ;U, V )− I(Y ;U, V )
= I(X ;U |V )− I(Y ;U |V )
= I(X ;U, V )− I(X ;V )− I(Y ;U, V ) + I(Y ;V )
= I(X ;U)− I(Y ;U)
≤ R
• The common randomness generated: I(X ;U ′) ≥ I(X ;U).
Thus using U ′ as the new auxiliary r.v. preserves the rate and does not decrease the
common randomness.
3.1.2 Continuity at R = 0
The C(R) curve is concave for R ≥ 0 thus continuous for R > 0. The following
theorem establishes the continuity at R = 0.
Theorem 2. The common randomness capacity as a function of the communication
rate R is continuous at R = 0, i.e., limR↓0C(R) = C(0).
Proof. See Appendix.
The value of C(0) is equal to the common information defined in [10]. We note
here that C(0) > 0 if and only if ρ(X ; Y ) = 1.
3.1.3 Initial Efficiency (R ↓ 0)
If the commonness between X and Y is large, then it is natural to expect that the
first few bits of communication should be able to unlock a huge amount of common
randomness. It is indeed the case as shown in the following theorem. Furthermore, the
HGR correlation ρ plays the key role in the characterization of the initial efficiency.
CHAPTER 3. COMMON RANDOMNESS GENERATION 15
Theorem 3. The initial efficiency of common randomness generation is characterized
as
limR↓0
C(R)
R=
1
1− ρ2(X ; Y ).
In words, the initial efficiency is the initial number of bits of common randomness
unlocked by the first few bits of communications.
Comments:
• Since ρ(X ; Y ) = ρ(Y ;X), the slope is symmetric in X and Y . Thus if we reverse
the direction of the communication link, i.e., the message is sent from Node 2
to Node 1 in Fig. 3.1, the initial efficiency remains the same.
• The initial efficiency increases with the HGR correlation ρ between X and Y .
Without communication, as long as ρ < 1, the common randomness capacity is
0. But with communication, the first few bits can “unlock” a huge amount of
common randomness if ρ(X ; Y ) is close to 1.
Proof. If ρ(X ; Y ) = 1, then C(0) > 0 which yields the +∞ slope. For the case
ρ(X ; Y ) < 1, we have
limR↓0
C(R)
R
(a)= sup
p(u|x)
I(X ;U)
I(X ;U)− I(Y ;U)(3.3)
(b)=
1
1− supp(u|x)I(Y ;U)I(X;U)
(c)=
1
1− ρ2(X ; Y )
where (a) comes from the fact C(R) is a concave function; (b) is because function 11−x
is monotonically increasing for x ∈ [0, 1); (c) comes from the following lemma [9].
Lemma 4. [9] supp(u|x)I(Y ;U)I(X;U)
= ρ2(X, Y )
CHAPTER 3. COMMON RANDOMNESS GENERATION 16
3.1.4 Efficiency at R ↑ H(X|Y ) (saturation efficiency)
At R = H(X|Y ), C(R) reaches its maximum value 2 H(X). That is the point where
Xn is losslessly known at Node 2. In other words, nature’s resource is exhausted by
the system. It is of interest to check the slope of C(R) when R goes up to H(X|Y ).
A natural guess is 1, since one pure random bit (which is independent of nature’s
(Xn, Y n)) sent over the communication link can yield 1 bit in common between the
two nodes. As shown in the erasure example in the next section, this guess is not
correct in general. Here, we provide a sufficient condition for the saturation slope to
be 1.
Theorem 4. The efficiency of common randomness generation at R = H(X|Y ) is 1
if there exist x1, x2 ∈ X such that for all y ∈ Y, if p(x1, y) > 0, then p(x2, y) > 03.
Proof. We have
limR↑H(X|Y )
C(H(X|Y ))− C(R)
H(X|Y )−R(3.4)
(a)= inf
p(u|x)
C(H(X|Y ))− I(X ;U)
H(X|Y )− (I(X ;U)− I(U ; Y ))
(b)= inf
p(u|x)
H(X)− I(X ;U)
H(X|Y )− (I(X ;U)− I(U ; Y ))
= infp(u|x)
H(X|U)
H(X|U)− (H(Y |U)−H(Y |X))
= infp(u|x)
1
1− (H(Y |U)−H(Y |X))H(X|U)
(c)=
1
1− infp(u|x)H(Y |U)−H(Y |X)
H(X|U)
where (a) comes from the concavity of C(R); (b) is because C(H(X|Y )) = H(X);
And (c) is because of the monotonicity of function 11−x
for x ∈ [0, 1).
2If private randomness is allowed, C(R) is a straight line with slope 1 for R > H(X |Y ) [2]. Theresult in this section thus give a sufficient condition for the slope at R = H(X |Y ) to be continuous.
3If two input letters are of the same conditional distribution p(y|x), then we view them as one
letter. Also, the letters with zero probability are discarded.
CHAPTER 3. COMMON RANDOMNESS GENERATION 17
The next step is to show infp(u|x)H(Y |U)−H(Y |X)
H(X|U)= 0 under the condition given
in the theorem. First note that H(Y |U) − H(Y |X) ≥ 0 because of U − X − Y .
Thus infp(u|x)H(Y |U)−H(Y |X)
H(X|U)≥ 0. Without loss of generality, we can assume that
X = {1, 2, ...,M}, Y = {1, 2, ..., N} and that P(x = 1, y) > 0 implies P(x = 2, y) > 0.
Choose a sequence of positive numbers ǫn converging to 0. Construct a sequence
of Un’s with cardinality {1, ...,M} such that
• PUn(1) = PX(1)− ǫn
1−ǫnPX(2),
• PUn(2) = 1
1−ǫnPX(2),
• PUn(u) = PX(u), u = 3, ...,M ,
which is illustrated in Fig. 3.2. Note that these are valid distributions because we
preserve the marginal distribution of X .
Un X
1
2
3
M
1
2
3
M
Y
1
2
3
N1
1
1
ǫn1− ǫn
PX(1)− ǫn1−ǫn
PX(2)
11−ǫn
PX(2)
PX(3)
PX(M)
......
......
Figure 3.2: The probability structure of Un.
As n goes to infinity, it can be shown that
• The denominator H(X|Un) behaves as ǫn log ǫn, i.e. H(X|Un) ∼= Θ(ǫn log ǫn);
• The numerator H(Y |U)−H(Y |X) behaves linearly, i.e., H(Y |U)−H(Y |X) ∼=Θ(ǫn).
Thus limn→∞H(Y |Un)−H(Y |X)
H(X|Un)= 0, which completes the proof.
For convenience, we introduce saturation efficiency in the following way:
Definition 6. The slope of C(R) when R approaches Rm from below is defined as
the saturation efficiency, where Rm is the threshold such that C(Rm) = H(X).
CHAPTER 3. COMMON RANDOMNESS GENERATION 18
3.2 Examples
3.2.1 DBSC(p) example
Let X be a Bernoulli (1/2) random variable and let Y be the output of a BSC channel
with cross probability p < 1/2, and with X as the input, shown in Fig. 3.3.
X Y
00
11
p
p
1− p
1− p
Figure 3.3: DBSC example: X ∼ Bern(1/2), pY |X(x|x) = (1− p), pY |X(1− x|x) = p.
H(X|U) = H2(α) for some α ∈ [0, 1/2]. Mrs Gerber’s Lemma [29]provides the
following lower bound on H(Y |U):
H(Y |U) ≥ H2(H−12 (H(X|U)) ∗ p), (3.5)
= H2(α ∗ p)
where (α ∗ p) = α(1− p) + (1− α)p. Thus
I(X ;U) = H(X)−H(X|U) = 1−H2(α),
I(X ;U)− I(Y ;U) = H(X)−H(X|U)−H(Y ) +H(Y |U)
= H(Y |U)−H(X|U)
≥ H2(α ∗ p)−H2(α).
Equality can be achieved by setting p(u|x) ={
x, w.p. 1− α;
1− x, w.p. α,
as shown in Fig. 3.2.1.
CHAPTER 3. COMMON RANDOMNESS GENERATION 19
XU
00
11
α
α
1− α
1− α
We can write C(R) in parametric form:
C = 1−H2(α) (3.6)
R = H2(α ∗ p)−H2(α), (3.7)
for α ∈ [0, 1/2]. Fig. 3.4 shows C(R) for p = 0.08.
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
C
R
Figure 3.4: C(R) for p = 0.08.
The initial efficiency:
limR↓0
C(R)
R(3.8)
= limα↑1/2
1−H2(α)
H2(α ∗ p)−H2(α)
CHAPTER 3. COMMON RANDOMNESS GENERATION 20
= limα↑1/2
− log21−αα
(1− 2p) log21−(1−2p)α−p(1−2p)α+p
− log21−αα
= limα↑1/2
11−α
+ 1α
(1− 2p)( −(1−2p)1−(1−2p)α−p
− 1−2p(1−2p)α+p
)+ ( 1
1−α+ 1
α)
=1
1− (1− 2p)2
Note that the HGR correlation between X and Y is (1− 2p)2.
The saturation efficiency, C ′(R−) as R approaches H(X|Y ):
limR↑H(X|Y )
C(H(X|Y ))− C(R)
H(X|Y )−R
= limα↓0
1− (1−H2(α))
H2(p)− (H2(α ∗ p)−H2(α))
= limα↓0
H2(α))
H2(p)−H2(α ∗ p) +H2(α)
= limα↓0
log21−αα
−(1− 2p) log21−(1−2p)α−p(1−2p)α+p
+ log21−αα
= limα↓0
log2 α
log2 α= 1
3.2.2 Gaussian example
Although we mainly consider discrete random variables with finite alphabet, the
results can be extended to continuous random variables as well. In this section, we
consider a Gaussian example. Let Y = X + Z, where X ∼ N (0, 1), Z ∼ N (0, N),
and X and Z are independent, illustrated in Fig. 3.5.
Let h(X|U) = 12log2(2πeα) for some 0 < α ≤ 1. The entropy power inequality [19]
CHAPTER 3. COMMON RANDOMNESS GENERATION 21
X ∼ N (0, 1) Y
Z ∼ N (0, N)
⊕
Figure 3.5: Gaussian Example
gives the following lower bound on h(Y |U):
h(Y |U) ≥ 1
2log2
(22h(X|U) + 22h(Z|U)
)(3.9)
=1
2log2
(22
12log2(2πeα) + 22
12log2(2πeN)
)
=1
2log2 (2πe(α +N))
Equality can be achieved by X = U+U ′ where U⊥U ′, U ∼ N (0, 1−α), U ′ ∼ N (0, α),
shown in Fig. 3.6
XU ∼ N (0, 1− α)
V ∼ N (0, α)
⊕
Figure 3.6: Auxiliary r.v. U in Gaussian example.
We write C(R) in a parametric form:
C = −1
2log2 α (3.10)
R =1
2log
α +N
(1 +N)α(3.11)
for α ∈ (0, 1]. Fig. 3.7 shows the case N = 0.5.
The initial efficiency is calculated in the following way:
limR↓0
C(R)
R= lim
α↑1
−12log2 α
12log α+N
(1+N)α
(3.12)
CHAPTER 3. COMMON RANDOMNESS GENERATION 22
0.2 0.4 0.6 0.8 1 1.2 1.4
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
C
R
Figure 3.7: Gaussian example: C(R) for N = 0.5
= limα↑1
− 12α
12(α+N)
− 12α
= 1 +1
N
Note the ordinary correlation between X and Y is 1/√1 +N . For a pair of joint
Gaussian random variables the HGR correlation is equal to the ordinary correlation
[16]. One can use Theorem 3 to obtain the same expression.
Asymptotic saturation efficiency
limR↑∞
dC(R)
dR= lim
α↓0
dC(α)dαdRdα
= limα↓0
−12log2 α
12log α+N
(1+N)α
= limα↓0
− 12α
12(α+N)
− 12α
= 1
For continuous r.v.’s, nature’s randomness is not exhausted at any finite R. It is
always more efficient to generate common randomness from nature’s resources than
from communicating private randomness generated locally.
CHAPTER 3. COMMON RANDOMNESS GENERATION 23
3.2.3 Erasure example
Let Y be an randomly erased version of X , i.e., Y =
{X, w.p. 1− q
e, w.p. q., shown in
Fig. 3.8.
X Y
e
Figure 3.8: Erasure example
For any U such that U − X − Y holds, I(Y ;U) = (1 − q)I(X ;U), I(X ;U) −I(X ; Y ) = qI(X ;U). Thus C(R) = R
qfor 0 ≤ R ≤ H(X|Y ), where H(X|Y ) =
q log2 |X |, shown in Fig. 3.9.
pH(X)
H(X)
R
C
Figure 3.9: Erasure example: C − R curve
The initial efficiency is therefore 1q. Since ρ(X ;X) = 1 and Y is an erased version
of X , we have ρ(X ; Y ) =√1− q. Note that 1
q= 1
1−ρ2(X;Y ).
The saturation efficiency is limR↑H(X|Y )C(R)R
= 1p, which is not equal to 1.
CHAPTER 3. COMMON RANDOMNESS GENERATION 24
3.3 Extensions
3.3.1 CR per unit cost
The communication link between Node 1 and Node 2 in Fig. 3.1 is a bit pipe, which
is essentially a noisyless channel. It turns out that the common randomness capacity
remains unchanged when we replace the bit pipe with a noisy channel with the same
capacity [2]. More interestingly, one may consider the case where the channel inputs
are subject to some cost constraints β. The initial efficiency of channel capacity C as
a function of β is solved in the seminal paper [26]. The initial efficiency of the overall
system, illustrated in Fig. 3.10, is thus the product of the initial efficiency of common
randomness generation and the capacity per unit cost of the channel.
Xn Y n
C(β)
K K ′
Node 1 Node 2
Figure 3.10: Common randomness per unit cost.
Corollary 1. The initial efficiency of Fig. 3.10 (common randomness per unit cost)
is equal to
limβ↓0
C(β)
β=
1
1− ρ2(X ; Y )· limβ↓0
C(β)β
We refer to [26] the calculation of limβ↓0C(β)β
.
3.3.2 Secret key generation
Common randomness generation is closely related to secret key generation [1]. Sup-
pose there is an eavesdropper listening to the communication link (Fig. 3.11). We
CHAPTER 3. COMMON RANDOMNESS GENERATION 25
would like the common randomness generated by Node 1 and Node 2 to be kept away
from the eavesdropper. One commonly used secrecy constraint is that
lim supn→∞
1
nI(M ;K) = 0,
where M ∈ [1 : 2nR] is the message Node 1 sends to Node 2.
Xn Y n
M ∈ [1 : 2nR]
K
K ′Node 1 Node 2
Eavesdropper1nI(M ;K) ≤ ǫ
Figure 3.11: Secret Key Generation
The secret key capacity is shown to be [1]: C(R) = maxR≥I(X;U)−I(Y ;U) I(Y ;U).
We can calculate the initial efficiency of the secret key capacity in the following way:
limR↓0
C(R)
R= sup
I(Y ;U)
I(X ;U)− I(Y ;U)
= supI(X ;U)
I(X ;U)− I(Y ;U)− 1
=1
1− ρ2(X ; Y )− 1,
which is the initial efficiency without the secrecy constraint minus one. It makes
sense, because the eavesdropper observers every bit Node 1 communicates to Node 2.
CHAPTER 3. COMMON RANDOMNESS GENERATION 26
3.3.3 Non-degenerate V
If the maximum common r.v. V is not a constant, the slope of C(R) as R ↓ 0 (It
differs from limR↓0C(R)R
) can be calculated in the following way:
limR↓0
C(R)− C(0)
R= sup
I(Y ;U)−H(V )
I(X ;U)− I(Y ;U)
(a)= sup
I(Y ;U, V )−H(V )
I(X ;U, V )− I(Y ;U, V )
= supI(Y ;U |V )
I(X ;U |V )− I(Y ;U |V )
=1
1− sup I(X;U |V )I(Y ;U |V )
=1
1−maxv ρ2(X ; Y |V = v)
where (a) is due to Lemma 3.
3.3.4 Broadcast setting
The common randomness generation setup can be generalized to multiple nodes.
A broadcast setting was considered in [2], shown in Fig. 3.12. The goal is for all
three nodes to generate a random variable K in common. The common randomness
Xn
Y n1
Y n2
R
K
K
K
Node 1
Node 2
Node 3
Figure 3.12: CR broadcast setup
CHAPTER 3. COMMON RANDOMNESS GENERATION 27
capacity is proved [2] to be
C(R) = maxp(u|x):R≥I(X;U)−I(Yi;U)≤R,i=1,2
I(X ;A,U)
We provide a conjecture that deals with the initial efficiency in the broadcast setting:
Conjecture 1. The initial efficiency of the setup in Fig. 3.12 is:
limR↓0
C(R)
R=
1
1−Ψ2(X ; Y, Z),
where Ψ(X ; Y, Z) is a modified HGR correlation between X and Y, Z, defined in the
following way:
Ψ(X ; Y, Z) = max min{Eg(X)f(Y ), Eg(X)h(Z)}
where the maximization is among all functions g f and h such that Eg(X) = 0, Ef(Y ) =
0, Eh(Z) = 0, Eg2(X) ≤ 1, Ef 2(Y ) ≤ 1, Eh2(Z) ≤ 1.
Proof. Achievability:
Similar to the HGR correlation, there is an alternative characterization of Ψ(X ; Y, Z):
Ψ2(X ; Y, Z) = max min{E(E[g(X)|Y ])2, E(E[g(X)|Z])2}
where the maximization is among function g such that Eg(X) = 0, Eg2(X) ≤ 1.
Applying the maximizer g∗(·) in the achievability scheme in [9], one can show that
the initial efficiency 11−Ψ2(X;Y,Z)
is achievable.
Chapter 4
Common randomness generation
with actions
4.1 Common randomness with action
Recently, in the line of work by Weissman, et al. [27], action was introduced as a
feature that one node can explore to boost the performance of lossy compression.
We adopt their setting but consider common randomness generation. The setup is
shown in Fig. 4.1. Comparing with the no action case, the key difference is that after
receiving the message M , Node 2 first generates an action sequence An based on M ,
i.e., An = fa(M). It then gets the side information Y n according to p(y|x, a), i.e.,Y n|(An, Xn) ∼ ∏n
i=1 p(yi|ai, xi). One scenario where this setting applies is that Node
2 requests side information from some data center through actions. The ith action
determines the type of the side information correlated with Xi that the data center
sends back to Node 2. Node 2 then generates K ′ based both on the Y n sequence
it observes and the message M(Xn) ∈ [1 : 2nR], K ′ = f2(Yn,M). The common
randomness capacity at rate R is defined in the same way as in the no action case.
28
CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 29
Xn Y nAn
M ∈ [1 : 2nR]
K K ′
Node 1 Node 2
Figure 4.1: Common Randomness Capacity: {Xi}i=1,... is an i.i.d. source. Node1 generates a r.v. K based on the Xn sequence it observes. It also generates amessage M and transmits the message to Node 2 under rate constraint R. Node 2first generates an action sequence An as a function of M and receives a sequence ofside information Y n, where Y n|(An, Xn) ∼ p(y|a, x). Then Node 2 generates a r.v.K ′ based on both M and Y n sequence it observes and M . We require P(K = K ′) tobe close to 1. The entropy of K measures the amount of common randomness thosetwo nodes can generate. What is the maximum entropy of K?
Theorem 5. The common randomness action capacity at rate R is
C(R) = max I(X ;A,U)
where the joint distribution is of the form p(a, u|x)p(x)p(y|a, x), and the maximization
is among all p(a, u|x) such that
I(X ;A) + I(X ;U |A)− I(Y ;U |A) ≤ R.
Cardinality of U can be bounded by |U| ≤ |X ||A|+ 1.
Setting A = ∅, we recover the no action result.
Achievability proof
Codebook generation
• Generate 2n(I(X;A)+ǫ) An(l1) sequences according to∏n
i=1 pA(ai), l1 ∈ 2n(I(X;A)+ǫ).
CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 30
• For each An(l1) sequence, generate 2n(I(X;U |A)+ǫ) Un(l1, l2) sequences according
to∏n
i=1 pU |A(ui|ai), l2 ∈ [1 : 2n(I(X;U |A)+ǫ)].
• For each An(l1) sequence, partition the set of l2 indices into 2n(I(X;U |A,Y )+2ǫ)
equal sized bins, B(l3).
Encoding
For simplicity, we will assume that the encoder is allowed to randomize, but the
randomization can be readily absorbed into the codebook generation stage, and hence,
does not use up the encoder’s private randomization.
• Given xn, the encoder selects the index LA ∈ [1 : 2n(I(X;A)+ǫ)] of the an(LA)
sequence such that (xn, an(LA)) ∈ T (n)ǫ . If there is none, it selects an index
uniformly at random from [1 : 2n(I(X;A)+ǫ)]. If there is more than one such
index, it selects an index uniformly at random from the set if indices such that
(xn, an(l)) ∈ T (n)ǫ .
• Given xn and the selected an(LA), the encoder then selects an index LU ∈ [1 :
2n(I(X;U |A)+ǫ)] such that (xn, an(LA), un(LA, LU)) ∈ T (n)
ǫ .
• The encoder sends out LA and LB ∈ [1 : 2n(I(X;U |A,Y )+2ǫ)] such that LU ∈ B(LB).
Decoding
The decoder first takes actions based on the transmitted An(LA) sequence. Therefore,
Y n is generated based on Y n ∼ ∏ni=1 p(yi|xi, ai(LA)). Given an and side information
yn, the decoder then tries to decode the LU index. That is, it looks for the unique LU
index in bin B(LB) such that (yn, an(LA), un(LA, LU)) ∈ T (n)
ǫ . Finally, the decoder
declares LA, LU as the common indices.
Analysis of probability of error
The analysis of probability of error follows standard analysis. An error occurs if any
of the following two events occur.
CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 31
1. (an(LA), un(LA, LU), X
n, Y n) /∈ T (n)ǫ .
2. There exists more than one LU ∈ B(LB) such that (Y n, an(LA), un(LA, LU)) ∈
T (n)ǫ .
The probability of the first error goes to zero as n goes to infinity since we generated
enough sequences to cover Xn in the codebook generation stage. The fact that the
probability of error for the second error event goes to zero as n → ∞ follows from
standard Wyner-Ziv analysis.
Analysis of common randomness rate
We analyze the common randomness rate averaged over codebooks.
H(LA, LU |C) = H(LA, LU , Xn|C)−H(Xn|C, LA, LU)
≥ H(Xn|C)−H(Xn|C, LA, LU , An(LA), U
n(LA, LU))
≥ nH(X)−H(Xn|Un, An). (4.1)
The second step follows from the fact that Xn is independent of the codebook and
the third step follows conditioning reduces entropy. We now proceed to upper bound
H(Xn|Un, An). Define E := 1 if (Xn, Un, An) /∈ T (n)ǫ and 0 otherwise.
H(Xn|Un, An) ≤ H(Xn, E|Un, An)
= H(E) +H(Xn|E,Un, An)
≤ 1 + P(E = 0)H(Xn|E = 0, Un, An) + P(E = 1)H(Xn|E = 1, Un, An)
(a)
≤ 1 + n(H(X|U,A) + δ(ǫ)) + nP(E = 1) log |X |= n(H(X|U,A) + δ′(ǫ)). (4.2)
(a) follows from the fact that when E = 0, (Un, An, Xn) ∈ T (n)ǫ . Hence, there are at
most 2n(H(X|U,A)+δ(ǫ)) possible Xn sequences. The last step follows from P(E = 1) → 0
as n → ∞, which in turn follows from the encoding scheme. Combining (4.1) with
CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 32
(4.2) then gives the desired lower bound on the achievable common randomness rate.
1
nH(LA, LU |C) ≥ H(X)−H(X|U,A)− δ′(ǫ)
= I(X ;U,A)− δ′(ǫ).
Converse: See Appendix C.1.
4.2 Example
By correlating the action sequence An with Xn and communicating the action se-
quence with Node 2, we incur a communication rate cost I(X ;A). That only gen-
erates I(X ;A) in the rate of CR generation. Using 1 bit of communication to get
1 bit common randomness is of course sub-optimal, but the benefit comes in the
second stage where conditioned on the An sequence, Un is sent to Node 2. The com-
munication rate required is I(X ;U |A) − I(Y ;U |A) and the rate of CR generated is
I(X ;U |A).One greedy scheme is to simply fix the action and just repeat it (so there is no
need to communicate An). We use the following example to show explicitly that in
general this kind of scheme is suboptimal.
Let X be a r.v. uniformly distributed over the set {1, 2, 3, 4}. There are two
actions A = 1 and A = 2. The probability structure conditioned on each sequence is
shown in 4.2.
Lemma 5. For the setup in Fig. 4.2,
• setting A⊥X: the optimal achievable (C,R) pair is given as
C(R) =3
2+
R
p, R ∈ [0, p/2]
CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 33
A = 1
X
1
2
3
4
e
Y
1
2
3
4
p
1− p
A = 2
X
1
2
3
4
e
Y
1
2
3
4
p
1− p
Figure 4.2: CR with Action example
• correlating A with X as shown in Fig. 4.3, the following (C,R) pair is achiev-
able:
C(α) = 2− α
R(α) = 1−H2(α), α ∈ [0, 1/2].
Proof. See Appendix
X A
α
α
1− α
1− α
0
1
1, 2
3, 4
Figure 4.3: Correlate A with X
It can be shown that the (C,R) pair achieved by setting A⊥X cannot be the
CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 34
optimal one for all R in general. We illustrate this by a numerical example p = 0.6
with the results plotted in Fig. 4.4.
0 0.05 0.1 0.15 0.21.5
1.55
1.6
1.65
1.7
1.75
1.8
1.85
1.9
Option oneOption two
R
C
Figure 4.4: CR with action example: option one: set A⊥X ; option two: correlate Awith X .
4.3 Efficiency
4.3.1 Initial Efficiency
For simplicity, we assume that ρ(PX ⊗ PY |X,A=a) < 1, ∀a ∈ A.
limR↓0
C(R)
R(4.3)
= supp(a,u|x)
I(X ;A,U)
I(X ;A) + I(X ;U |A)− I(Y ;U |A)
=1
1− supp(a,u|x)I(Y ;U |A)
I(X;A)+I(X;U |A)
=1
1−maxa∈A ρ2(X, Y |A = a)
where ρ(X, Y |A = a) = ρ(PX ⊗ PY |X,A=a) and the last step is proved in Ap-
pendix. C.2.
CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 35
4.3.2 Saturation efficiency
Similar to the no action case, when the communication rate reaches the threshold such
that Xn can be losslessly reconstructed at Node 2, nature’s randomness Xn is ex-
hausted by the system. Thus the maximum CR H(X) (without private randomness)
is achieved. This threshold Rm can be computed as [27]:
Rm = minp(a|x)
I(X ;A) +H(X|A, Y ).
The following theorem consider the slope of CR generation when R ↑ Rm.
Theorem 6. If there exists a p(a|x) such that
• I(X ;A) +H(X|A, Y ) = Rm
• For each action a, P(A = a) > 0, there exist x1, x2 ∈ X such that P(X = x1|A =
a)) > 0, P(X = x2|A = a) > 0, if p(y, x1|A = a) > 0, then p(y, x2|A = a) > 0,
∀y ∈ Y.
then
limR↑Rm
dC(R)
dR= 1
Essentially, we require the condition in the no action setting to hold for each active
action when R ↑ Rm.
4.4 Extensions
Theorem 5 extends to the case where there is a cost function Λ and a cost constraint
Γ on the action sequence, i.e., Λ(An) = 1n
∑ni=1 Λ(Ai) ≤ Γ.
Corollary 2. The common randomness capacity with rate constraint R and cost
constraint Γ is
C(R,Γ) = max
p(a, u|x) : Λ(A) ≤ Γ
R ≥ I(X ;A) + I(X ;U |A, Y )
I(X ;A,U)
CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 36
Proof. Simply note that the achievablity proof and converse carry over to this setting
directly.
Theorem 5 also extends naturally to the case where there are multiple receivers
with different side information (Fig. 4.4).
Xn
Y n1
Y n2An
An
R
K1
K2
K3
Node 1
Node 2
Node 3
Corollary 3. The common randomness capacity with rate constraint R with two
receivers and side information structure (Y n1 , Y
n2 )|Xn, An ∼ i.i.d. p(y1, y2|x, a) is
given by
C(R) = max
p(a, u|x) : Λ(A) ≤ Γ
R ≥ I(X ;A) + I(X ;U |A, Yi), i = 1, 2
I(X ;A,U)
Because the action sequence of each node is a function of the same message both
receives, one node knows the action sequence of the other node. Therefore we do
not lose optimality by setting Ai = (A1i, A2i), where A1i and A2i are the individual
actions.
Proof. We may simply repeat the achievablity proof for each receiver, and recognize
that the auxiliary random variable UQ in the converse proof C.1 works for both
receivers.
Chapter 5
Compression with actions
5.1 Introduction
Consider an independent, identically distributed (i.i.d) binary sequence Sn, S ∼Bern(1/2). From standard source coding theory [19], we need at least one bit per
source symbol to describe the sequence for lossless compression. But suppose now
that we are allowed to make some modifications, subject to cost constraints, to the
sequence before compressing it, and we are only interested in describing the modified
sequence losslessly. The problem then becomes one of choosing the modifications so
that the rate required to describe the modified sequence is reduced, while staying
within our cost constraints. More concretely, for the binary sequence Sn, if we are
allowed to flip more than n/2 ones to zero, then the rate required to describe the
modified sequence is essentially zero. But what happens when we are allowed to flip
fewer than n/2 ones?
As a potentially more practical example, imagine we have a number of robots
working on a factory floor and the positions of all the robots need to be reported to
a remote location. Letting S represent the positions of the robots, we would expect
to send H(S) bits to the remote location. However, this ignores the fact that the
robots can also take actions to change their positions. A local command center can
first “take a picture” of the position sequence and then send out action commands
to the robots based on the picture so that they move in cooperative way such that
37
CHAPTER 5. COMPRESSION WITH ACTIONS 38
the final position sequence requires fewer bits to describe. The command center may
face two issues in general: cost constraints and uncertainty. A cost constraint occurs
because each robot should save its power and not move too far away from its current
location. The uncertainty is a result of the robots not moving exactly as instructed
by the local command center.
Motivated by the preceding examples, we consider the problem illustrated in
Fig. 5.1 (Formal definitions will be given in the next section). Sn is our observed
state sequence. We model the constraint as a general cost function Λ(·, ·, ·) and the
uncertainty in the final output Y by a channel p(y|a, s).
PY |X,S
Sn
Y n 2nR
Zn
Y nAn
Compressor DecoderAction Encoder
Figure 5.1: Compression with actions. The Action encoder first observes the statesequence Sn and then generates an action sequence An. The ith output Yi is theoutput of a channel p(y|a, s) when a = Ai and s = Si. The compressor generates adescription M of 2nR bits to describe Y n. The remote decoder generates Y n basedon M and it’s available side information Zn as a reconstruction of Y n.
Our problem setup is closely related to the channel coding problem when the
state information is available at the encoder. The case where the state information
is causally available was first solved by Shannon in [20]. When the state information
is non-causally known at the encoder, the channel capacity result was derived in
[11] and [13]. Various interesting extensions can be found in [15, 17, 22–24]. The
difference in our approach described here is that we make the output of the channel
as compressible as possible. We give formal definitions for our problem are given in
the next section. Our main results when the decoder requires lossless reconstruction
are given in section 5.3, where we characterize the rate-cost tradeoff function for the
setting in Fig. 5.1. We also characterize the rate-cost function when Sn is only
causally known at the action encoder. In section 5.4, we extend the setting to the
lossy case where the decoder requires a lossy version of Y n.
CHAPTER 5. COMPRESSION WITH ACTIONS 39
5.2 Definitions
We give formal definitions for the setups under consideration in this section. We will
follow the notation of [8]. Sources (Sn, Zn) are assumed to be i.i.d.; i.e. (Sn, Zn) ∼∏n
i=1 pS,Z(si, zi).
5.2.1 Lossless case
Referring to Figure 5.1, a (n, 2nR) code for this setup consists of
• an action encoding function fe : Sn → An;
• a compression function fc : Yn → M ∈ [1 : 2nR];
• a decoding function fd : [1 : 2nR]× Zn → Y n.
The average cost of the system is EΛ(An, Sn, Y n) , 1n
∑ni=1 EΛ(Ai, Si, Yi). A rate-
cost tuple (R,B) is said to be achievable if there exists a sequence of codes such
that
lim supn→∞
Pr(Y n 6= fd(fc(Yn), Zn)) = 0, (5.1)
lim supn→∞
EΛ(An, Sn, Y n) ≤ B, (5.2)
where Λ(An, Sn, Y n) =∑n
i=1 Λ(Ai, Si, Yi)/n. Given cost B, the rate-cost function,
R(B), is then the infimum of rates R such that (R,B) is achievable.
5.2.2 Lossy case
We also consider the setup where the decoder requires a lossy version of Y n. The
definitions remain largely the same, with the exception that the probability of error
constraint, inequality (5.1), is replaced by the following distortion constraint:
lim supn→∞
E d(Y n, Y n) = lim supn→∞
1
n
n∑
i
E d(Yi, Yi) ≤ D. (5.3)
CHAPTER 5. COMPRESSION WITH ACTIONS 40
A rate R is said to be achievable if there exists a sequence of (n, 2nR) codes satisfying
both the cost constraint (inequality 5.2) and the distortion constraint (inequality 5.3).
Given cost B and distortion D, the rate-cost-distortion function, R(B,D), is then the
infimum of rates R such that the tuple (R,B,D) is achievable.
5.2.3 Causal observations of state sequence
In both the lossless and lossy case, we will also consider the setup when the state
sequence is only causally known at the action encoder. The definitions remain the
same, except for the action encoding function which is now restricted to the following
form: For each i ∈ [1 : n], fe,i : Si → A.
5.3 Lossless case
In this section, we present our main results for the lossless case. Theorem 7 gives
the rate-cost function when the state sequence is noncausally available at the action
encoder, while Theorem 8 gives the rate-cost function when the state sequence is
causally available.
5.3.1 Lossless, noncausal compression with action
Theorem 7 (Rate-cost function for lossless, noncausal case). The rate-cost function
for the compression with action setup when state sequence Sn is noncausally available
at the action encoder is given by
R(B) = minp(v|s),a=f(s,v):EΛ(S,A,Y )≤B
I(V ;S|Z) +H(Y |V, Z), (5.4)
where the joint distribution is of the form p(s, v, a, y) = p(s)p(v|s)1{f(s,v)=a}p(y|a, s).The cardinality of the auxiliary random variable V is upper bounded by |V| ≤ |S|+2.
Remarks
• Replacing a = f(s, v) by a general distribution p(a|s, v) does not decrease the
minimum in (5.4). For any joint distribution p(s)p(s|v)p(a|s, v), we can always
CHAPTER 5. COMPRESSION WITH ACTIONS 41
find a random variable W and a function f such that W is independent of S, V
and Y , and A = f(V,W,X). Consider V ′ = (V,W ). The Markov condition
V ′ − (A, S) − (Y, Z) still holds. Thus H(Y |V ′, Z) + I(V ′;S|Z) is achievable.
Furthermore,
I(V ′;S|Z) +H(Y |V ′, Z)
= I(V,W ;S|Z) +H(Y |V,W,Z)
≤ I(V,W ;S|Z) +H(Y |V, Z)= I(V ;S|Z) +H(Y |V, Z).
• R(B) is a convex function in B.
• For each cost function Λ(s, a, y), we can replace it with a new cost function
involving only s and a by defining Λ′(s, a) = E[Λ(S,A, Y )|S = s, A = a]. Note
that Y is distributed as p(y|s, a) given S = s, A = a.
Achievability of Theorem 7 involves an interesting observation in the decoding oper-
ation, but before proving the theorem, we first state a corollary of Theorem 7, the
case when side information is absent (Z = ∅). We will also sketch an alternative
achievability proof for the corollary, which will serve as a contrast to the achievability
scheme for Theorem 7.
Corollary 4 (Side information is absent). If Z = ∅, then rate-cost function is given
by
R(B) = minp(v|s),a=f(s,v):EΛ(S,A,Y )≤B
I(V ;S) +H(Y |V )
for some p(s, v, a, y) = p(s)p(v|s)1{f(s,v)=a}p(y|a, s).
Achievability for Corollary 1
Code book generation: Fix p(v|s) and f(s, v) and ǫ > 0.
CHAPTER 5. COMPRESSION WITH ACTIONS 42
• Generate 2n(I(S;V )+ǫ) vn(l) sequences independently, l ∈ [1 : 2n(I(V ;S)+ǫ)], each
according to∏
pV (vi) to cover Sn.
• For each V n sequence, the Y n sequences that are jointly typical with V n are
indexed by 2(n(H(Y |V )+ǫ) numbers.
Encoding and Decoding:
• The action encoder looks for a V n in the code book that is jointly typical with
Sn and generates Ai = f(Si, Vi), i = 1, ..., n.
• The compressor looks for a V n in the codebook that is jointly typical with the
channel output Y n and sends the index of that V n sequence to the decoder.
The compressor then sends the index of Y n as described in the second part of
code book generation.
• The decoder simply uses both indices from the compressor to reconstruct Y n.
Using standard typicality arguments, we can show that the encoding succeeds
with high probability and the probability of error can be made arbitrarily small.
Remark: Note that the index of V n is not necessarily equal to V n. That is, the
V n codeword chosen by the action encoder can be different from the V n codeword
chosen by the compressor. But this is not an error event since we still recover the
same Y n even if a different V n codeword was used.
This scheme, however, does not extend to the case when side information is avail-
able at the decoder. The term H(S|Z, V ) in Theorem 7 requires us to bin the set of
Y n sequences according to the side information available at the decoder. If we were
to extend the above achievability scheme, we would bin the set of Y n sequences to
2n(H(Y |Z,V )+ǫ) bins. The compressor would find a V n sequence that is jointly typical
with Y n, send the index to the decoder using a rate of I(V ;S|Z) + ǫ, and then send
the index of the bin which contains Y n. The decoder would then look for the unique
Y n sequence in the bin that is jointly typical with V n and Zn. Unfortunately, while
the V n codeword is jointly typical with Y n with high probability, it is not necessarily
jointly typical with Zn, since V n may not be equal to V n (V n is jointly typical with
CHAPTER 5. COMPRESSION WITH ACTIONS 43
Zn with high probability as V n is jointly typical with Sn with high probability and
V − S − Z). One could try to overcome this problem by insisting that the compres-
sor finds the same V n sequence as the action encoder, but this requirement imposes
additional constraints on the achievable rate.
Instead of requiring the compressor to find a jointly typical V n sequence, we use an
alternative approach to prove Theorem 7. We simply bin the set of all Y n sequences to
2n(I(V ;S|Z)+H(Y |Z,V )+ǫ) bins and send the bin index to the decoder. The decoder looks
for the unique Y n sequence in bin M such that (V n(l), Y n, Zn) are jointly typical for
some l ∈ [1 : 2n(I(V ;S)+ǫ)]. Note that there can more than one V n(l) sequence which is
jointly typical with (Y n, Zn), but this is not an error event as long as the Y n sequence
in bin M is unique. We now give the details of this achievability scheme.
Proof of achievability for Theorem 7
Codebook generation
• Generate 2n(I(V ;S)+δ(ǫ)) V n codewords according to∏n
i=1 p(vi)
• For the entire set of possible Y n sequences, bin them uniformly at random to
2nR bins, where R > I(V ;S)− I(V ;Z) +H(Y |Z, V ), B(M).
Encoding
• Given sn, the encoder looks for a vn sequence in the codebook such that
(vn, sn) ∈ T (n)ǫ . If there is more than one, it randomly picks one from the set of
typical sequences. If there is none, it picks a random index from [1 : 2nI(V ;S)+δ(ǫ)].
• It then generates an according to ai = f(vi, si) for i ∈ [1 : n].
• At the second encoder, it takes the output yn sequences and sends out the bin
index M such that yn ∈ B(M).
CHAPTER 5. COMPRESSION WITH ACTIONS 44
Decoding
• The decoder looks for the unique yn sequence such that (vn(l), yn, zn) ∈ T (n)ǫ
for some l ∈ [1 : 2n(I(V ;S))] and yn ∈ B(M). If there is none or more than one,
it declares an error.
Analysis of probability of error
Define the following error events
E0 := {(V n(L), Zn, Y n) /∈ T (n)ǫ },
El := {(V n(l), Zn, Y n) ∈ T (n)ǫ
for some Y n 6= Y n, Y n ∈ B(M)}.
By symmetry of the codebook generation, it suffices to consider M = 1. The
probability of error is upper bounded by
P(E) ≤ P(E0) +2n(I(V ;S)+δ(ǫ))∑
l=1
P(El).
P(E0) → 0 as n → ∞ following standard analysis of probability of error. It remains
to analyze the second error term. Consider P(El) and define
El(V n, Zn) := {(V n(l), Zn, Y n) ∈ T (n)ǫ for some Y n 6= Y n, Y n ∈ B(1)}
. We have
P(El) = P(El(V n, Zn))
=∑
(vn,zn)∈T (n)ǫ
P(V n(l) = vn, Zn = zn) P(El(vn, zn)|vn, zn)
=∑
(vn,zn)∈T (n)ǫ
(P(V n(l) = vn, Zn = zn).
CHAPTER 5. COMPRESSION WITH ACTIONS 45
∑
yn
P(Y n = yn|vn, zn) P(El(vn, zn)|vn, zn, yn))
(a)
≤∑
(vn,zn)∈T (n)ǫ
(P(V n(l) = vn, Zn = zn).
∑
yn
P(Y n = yn|vn, zn)2n(H(Y |Z,V )+δ(ǫ)−R)
)
(b)=
∑
(vn,zn)∈T (n)ǫ
(P(V n(l) = vn) P(Zn = zn).
2n(H(Y |Z,V )+δ(ǫ)−R))
≤(2n(H(V,Z)+δ(ǫ))2−n(H(V )−δ(ǫ))2−n(H(Z)−δ(ǫ)).
2n(H(Y |Z,V )+δ(ǫ)−R))
= 2n(H(Y |V,Z)−I(V ;Z)−R−4δ(ǫ)).
(a) follows since the set of Y n sequences are binned uniformly at random indepen-
dent of other Y n sequences, and the fact that there are at most 2n(H(Y |Z,V )+δ(ǫ)) Y n
sequences which are jointly typical with a given typical (vn, zn). (b) follows from the
fact that the codebook generation is independent of (Sn, Zn). Therefore, for any fixed
l, V n(l) is independent of Zn. Hence, if R ≥ I(V ;S)− I(V ;Z) +H(Y |Z, V ) + 6δ(ǫ),
2n(I(V ;S)+δ(ǫ))∑
l=1
P(El) ≤ 2−nδ(ǫ) → 0,
as n → ∞.
We now turn to the proof of converse for Theorem 7
Proof of converse for Theorem 7
Given a (n, 2nR) code for which the probability of error goes to zero with n and
satisfies the cost constraint, define Vi = (Zn\i, Sni+1, Y
i−1), we have
CHAPTER 5. COMPRESSION WITH ACTIONS 46
nR
≥ H(M |Zn)
= H(M,Y n|Zn)−H(Y n|M,Zn)
(a)= H(M,Y n|Zn)− nǫn
= H(Y n|Zn)− nǫn
=n∑
i=1
H(Yi|Y i−1, Zn)− nǫn
=
n∑
i=1
H(Yi|Y i−1, Sni+1, Z
n) + I(Yi;Sni+1|Y i−1, Zn)− nǫn
(b)=
n∑
i=1
H(Yi|Y i−1, Sni+1, Z
n) +n∑
i=1
I(Y i−1;Si|Sni+1, Z
n)− nǫn
(c)=
n∑
i=1
H(Yi|Y i−1, Sni+1, Z
n) +
n∑
i=1
I(Y i−1, Sni+1, Z
n\i;Si|Zi)− nǫn
(d)=
n∑
i=1
H(Yi|Vi, Zi) +
n∑
i=1
I(Vi;Si|Z i)− nǫn
= nH(YQ, |VQ, Q, ZQ) + nI(VQ;SQ|Q,ZQ)− nǫn
where (a) is due to Fano’s inequality. (b) follows from Csiszar sum identity. (c) holds
because (Sn, Zn) is an i.i.d source. Note that the Markov conditions, Vi−(Si, Ai)−Yi
and Vi − Si − Zi hold. Finally, we introduce Q as the time sharing random variable,
i.e., Q ∼ Unif[1, ..., n], and set V = (VQ, Q), Y = YQ and S = SQ, which completes
the proof.
5.3.2 Lossless, causal compression with action
Our next result gives the rate-cost function for the case of lossless, causal compression
with action.
CHAPTER 5. COMPRESSION WITH ACTIONS 47
Theorem 8 (Rate-cost function for lossless, causal case). The rate for the compres-
sion with action when the state information is causally available at the action encoder
is given by
R(B) = minp(v),a=f(s,v):EΛ(S,A,U)≤B
H(Y |V, Z) (5.5)
where the joint distribution is of the form p(s, v, a, y) = p(s)p(v)1{f(s,v)=a}p(y|a, s).
Achievability sketch (to write up): Here V simply serves as a time-sharing random
variable. Fix a p(v) and f(s, v). We first generate a V n sequence and reveal it to
the action encoder, the compressor and the decoder. The encoder generates Ai =
f(Si, Vi). The compressor simply bins the set of Y n sequences to 2n(H(Y |V,Z)+ǫ) bins
and sends the index of the bin which contains Y n. The decoder recovers Y n by finding
the unique Y n sequence in bin M such that (V n, Zn, Y n) are jointly typical.
Remark : Just as in the non-causal case, the achievability is closely related to the
channel coding strategy in [11], our achievability in this section uses the “Shannon
Strategy” in [20]. In both cases, the optimal channel coding strategy yield the most
compressible output when the message rate goes to zero.
Proof of Converse: Given a (n, 2nR) code that satisfies the constraints, define
Vi = (Si−1, Zn\i). We have
nR ≥ H(M |Zn)
= H(M,Y n|Zn)−H(Y n|M,Zn)(a)= H(M,Y n|Zn)− nǫn
= H(Y n|Zn)− nǫn
=n∑
i=1
H(Yi|Y i−1, Zi, Zn\i)− nǫn
≥n∑
i=1
H(Yi|Y i−1, Ai−1, Si−1, Zi, Zn\i)− nǫn
(b)=
n∑
i=1
H(Yi|Ai−1, Si−1, Zi, Zn\i)− nǫn
(c)=
n∑
i=1
H(Yi|Vi, Zi)− nǫn
CHAPTER 5. COMPRESSION WITH ACTIONS 48
(d)= nH(YQ|VQ, Q, ZQ)− nǫn
where (a) is due to Fano’s inequality; (b) follows from the Markov chain Yi −(Si−1, Ai−1, Zn) − Y i−1 ; (c) follows since Ai−1 is a function of Si−1. Note that Ai
is now a function of Si and Vi. Finally, we introduce Q as the time sharing random
variable, i.e., Q ∼ Unif[1, ..., n]. Thus, by setting V = (VQ, Q) and Y = YQ, we have
completed the proof.
5.3.3 Examples
In this subsection, we consider an example with state sequence Sn ∼ i.i.d. Bern(1/2)
and Z = ∅. We have two actions available, A = 0 and A = 1. The cost constraint is
on the frequency of action A = 1, EA ≤ B. The channel output Yi = Si ⊕ Ai ⊕ SNi
where ⊕ is the modulo 2 sum and {SNi} are i.i.d. Bern(p) noise, p < 1/2. The
example is illustrated in Fig. 5.2.
An
Y n
Y n
++Action
Encoder
EA ≤ B
Compressor DecoderM ∈
{1, .., 2nR}
SnN ∼ i.i.d Bern(p)
Sn ∼ i.i.d Bern(1/2)
Figure 5.2: Binary example with side information Z = ∅.
We use the following lemma to simplify the optimization problem in Eq. (5.4)
applied to the binary example.
Lemma 6. For the binary example, it is without loss of optimality to have the fol-
lowing constraints when solving the optimization problem of Eq. (5.4):
• V = {0, 1, 2}, P(V = 0) = P(V = 1) = θ/2, for some θ ∈ [0, 1].
• The function a = f(s, v) is of the form: f(s, 0) = s, f(s, 1) = 1 − s and
f(s, 2) = 0.
CHAPTER 5. COMPRESSION WITH ACTIONS 49
• P(S = 0|V = 1) = P(S = 1|V = 0) = ∆ and P(S = 0|V = 2) = 1/2.
• ∆θ ≤ B.
Note that the constraints guarantee that P(S = 0) = P(S = 1) = 1/2.
Proof. See Appendix. D.1
Using Lemma 6, we can simplify the objective function in Eq. (5.4) in the following
way:
H(Y |V ) + I(V ;S)
= H(Y |V )−H(S|V ) +H(S)
= H(S ⊕ A⊕ SN |V )−H(S|V ) + 1
=θ
2(H(0⊕ SN |V = 0)−H(∆))
+θ
2{H(1⊕ SN |V = 1)−H(∆)}
+(1− θ) {H(S ⊕ SN |V = 2)− 1}+ 1
= θ (H2(p)−H(∆)) + 1
where H2(·) is the binary entropy function, i.e., H2(δ) = −δ log δ− (1− δ) log(1− δ).
R(B) = minθ∈[2B,1], θ∆≤B
θ (H2(p)−H(∆)) + 1
= 1 + min∆∈[B,1/2]
B
∆(H2(p)−H2(∆))
= 1− B max∆∈[B,1/2]
H2(∆)−H2(p)
∆
=
{1− BH(b∗)−H2(p)
b∗, if 0 ≤ B < b∗
1−H2(B) +H2(p), if b∗ ≤ B ≤ 1/2(5.6)
where b∗ is the solution of the following function:
H2(b)−H2(p)
b=
dH2
db, b ∈ [0, 1/2] (5.7)
CHAPTER 5. COMPRESSION WITH ACTIONS 50
which is illustrated in Fig. 5.3.
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
b
H2(b
)
H2(b) vs b
(p,H(p))
b∗p
H(b∗)−H(p)
b∗
H2(b∗)−H2(p)b∗
= dH2
db
∣∣b=b∗
Figure 5.3: The threshold b∗ solves H2(b)−H2(p)b
= dH2
db, b ∈ [0, 1/2]
Now let us shift our attention to the causal case of the binary example, i.e., Si is
only causally available at the action encoder.
Lemma 7. For the causal case of the binary example, it is without loss of optimality
to have the following constraints when solving the optimization problem in Eq. (5.5):
• V = {0, 1}, P(V = 0) = θ, for some θ ∈ [0, 1].
• The function a = f(s, v) is of the form: f(s, 0) = s, f(s, 1) = 0.
• θ2≤ B.
Proof. See Appendix. D.2
Using Lemma 7, we can simplify the objective function in Eq. (5.5) in the following
way:
R(B)
= minH(Y |V )
= minθ∈[0,1], θ
2≤B
θH(Y |V = 0) + (1− θ)H(Y |V = 1)
CHAPTER 5. COMPRESSION WITH ACTIONS 51
= minθ∈[0,1], θ
2≤B
θH(Z|V = 0) + (1− θ)H(S ⊕ Z|V = 1)
= minθ∈[0,1], θ
2≤B
θH2(p) + (1− θ)
=
{2BH2(p) + (1− 2B), 0 ≤ B ≤ 1/2;
H2(p), 1/2 ≤ B.
For the binary example with p = 0.1, we plot the rate-cost function R(B) for both
cases in the following figure.
0 0.1 0.2 0.3 0.4 0.50.4
0.5
0.6
0.7
0.8
0.9
1
Cost Constraint B
R(B
)
R(B) vs. B, p=0.1
CausalNon−Causal
H2(0.1)
Figure 5.4: Comparison between the non-causal and causal rate-cost functions. Theparameter of the Bernoulli noise is set at 0.1.
5.4 Lossy compression with actions
In this section, we extend our setup to the lossy case. We give an achievable rate-
cost-distortion region when Sn is available noncausally at the action encoder and
characterize the rate-cost-distortion function when Sn is available causally at the
encoder and Z = ∅.
Theorem 9. An upper bound on the rate-cost function for the case with non-causal
CHAPTER 5. COMPRESSION WITH ACTIONS 52
state information is given by
R(B) ≤ minEΛ(S,A,Y )≤B,Ed(Y,Y )≤D
I(V ;S|Z) + I(Y ; Y |V, Z) (5.8)
where the joint distribution is of the form :
p(s, v, a, y, y, z) = p(s, z)p(v|s)1{f(s,v)=a}p(y|a, s)p(y|y, v).
Sketch of achievability : The codebook generation and the encoding for the action
encoder is largely the same as that for the lossless case. We generate 2n(I(V ;S)+ǫ)
V n sequences according to∏n
i=1 pV (vi), and for each vn, generate 2n(I(Y ;Y V )+ǫ) yn
sequences according to∏n
i=1 p(yi|vi). The set of vn sequences are partitioned to
2n(I(V ;Z)+2ǫ) equal sized bins, B(m0), and for each m0, the set of yn sequeces are
partitioned to 2n(I(Y ;Z|V )+2ǫ) equal sized bins, B(m0, m1). Given a sequence sn, the
action encoder finds the vn sequence which is jointly typical with sn and takes actions
according to Ai = f(si, vi) for i ∈ [1 : n]. At the compressor, we first find a vn
that is jointly typical with Y n and then, a yn such that (vn, yn, yn) ∈ T (n)ǫ . The
compressor then sends the indices M0,M1 such that the selected vn ∈ B(M0) and yn ∈B(M0,M1). The decoder first recovers vn by looking for the unique vn ∈ B(m0) such
that (vn, zn) ∈ T (n)ǫ . next, it recovers yn by looking for the unique yn ∈ B(m0, m1)
such that (vn, zn, yn) ∈ T (n)ǫ . From the rates given, it is easy to see that all encoding
and decoding steps succeed with high probability as n → ∞.
We now turn to the case when sn is causally known at the action encoder. In
this case, we are able to characterize the rate-cost-distortion function when no side
information is available, Z = ∅.
Theorem 10. The rate-cost-distortion function for the case with causal state infor-
mation and no side information is given by
R(B) = mina=f(s,v):EΛ(S,A,U)≤B,Ed(Y,Y )≤D
I(Y ; Y |V, Z) (5.9)
where the joint distribution is of the form p(s, v, a, y, y) = p(s)p(v)1{a=f(s,v)}p(y|a, s)p(y|y, v).
The achievability is straightforward, with V as the time sharing random variable
known to all parties and follows similar analysis as Theorem 8.
CHAPTER 5. COMPRESSION WITH ACTIONS 53
Converse: Given a (n, 2nR) code satisfying the cost and distortion conditions, we
have
nR ≥ H(M)
≥ I(M ; Y n)
=n∑
i=1
I(M ; Yi|Y i−1)
(a)=
n∑
i=1
I(M ; Yi|Vi)
(b)
≥n∑
i=1
I(Yi; Yi|Vi)
(c)= nI(YQ; YQ|VQ, Q)
where in (a) we set Vi = Y i−1. (b) holds because Y n is a function of M ; Note that
Vi is independent of Si. In (c) we introduce Q as the time sharing random variable,
i.e., Q ∼ Unif[1, ..., n]. Thus, by setting V = (VQ, Q) and Y = YQ, we have shown
that R(B,D) ≥ I(Y , Y |V ) where V ⊥ S. It is, however, equivalent to the region in
the theorem because of the following:
• Replacing p(y|y, v) by a general distribution p(y|a, y, v, s) does not decrease theminimum in (5.9) since the mutual information term I(Y ; Y |V ) only depends
on the marginal distribution of p(y, y, s).
• Replacing a = f(s, v) by a general distribution p(a|s, v) does not decrease the
minimum in (5.9), because for any joint distribution p(s)p(s|v)p(a|s, v)p(y|a, s)p(y|y, s),I(Y ; Y |V = v) is a concave function in p(y|s, v), which is a linear function of
p(a|s, v).
Chapter 6
Conclusions
In this thesis, we first revisited Gacs and Korner’s definition of common information.
It is equal to the common randomness that two remote nodes, with access to X and
Y respectively, can generate without communication. The fact that this quantity is
degenerate for most cases motivated us to investigate the initial efficiency of common
randomness generation when the communication rate goes to zero. It turned out that
the initial efficiency is equal to 11−ρ(X;Y )2
, where ρ is the Hirschfeld-Gebelein-Renyi
maximal correlation between X and Y . This result gave Hirschfeld-Gebelein-Renyi
maximal correlation an operational justification as a measure of commonness between
two random variables. The result also indicated that communication is the key to
unlock common randomness. And then we turned to the saturation efficiency as the
communication exhausts nature’s randomness. We provided a sufficient condition for
the saturation efficiency to be 1, which implies the continuity of the slope of common
randomness generation at that point. An example was given to show that the slope
is not continuous in general.
In the next part of the thesis, we introduced common randomness generation with
actions, in which a node can take actions to influence the random variables received
from nature. A single letter expression of the common randomness-rate function was
obtained. We showed through an example that the greedy approach of fixing the
“best” action is not optimal in general when communication rate is strictly positive.
But as the rate goes down to zero, the initial efficiency in the action setting was proved
54
CHAPTER 6. CONCLUSIONS 55
to be 11−maxa∈A ρ2(X,Y |A=a)
, i.e., the reciprocal of one minus the square of Hirschfeld-
Gebelein-Renyi maximal correlation conditioned on the best action. The saturation
efficiency with actions was analyzed similarly to the no-action setting.
In the last part of the thesis, we kept the action feature, but shifted our focus to
source coding. The idea that one could modify a source subject to a cost constraint
before compression was formulated in an information theoretical setting. Techniques
from both channel coding and source coding were combined to obtain a single letter
expression of the rate-cost function. In our achievability scheme, modification of the
source sequence is essentially equivalent to setting up cloud centers of the source
sequence. Compression of the modified sequence is carried out via a classic binning
approach. Interestingly, this approach does not require correct decoding of the cloud
center.
Appendix A
Proofs of Chapter 2
A.1 Proof of the convexity of ρ(PX ⊗ PY |X) in PY |X
The inequality holds trivially if any one of the r.v.’s is degenerate (i.e. is equal to a
constant with probability 1). We exclude the case in the following proof.
Fix arbitrary functions f and g such that Eg(X) = Ef(Y, Z) = 0, Eg2(X) = 1,
and Eg2(Y, Z) = 1. Without loss of generality we can assume Eg(X)f(Y, Z) ≥ 0
(otherwise we can consider −f instead of f). Define µ(z) = E[f(Y, Z)|Z = z].
Eg(X)f(Y, Z) (A.1)
= Eg(X)(f(Y, Z)− µ(Z) + µ(Z))
= Eg(X)(f(Y, Z)− µ(Z)) + Eg(X)µ(Z)(a)= Eg(X)(f(Y, Z)− µ(Z)) + Eg(X)Eµ(Z)(b)= Eg(X)(f(Y, Z)− µ(Z)),
where (a) is because X⊥Z and (b) is due to Eg(X) = 0. Define η =√
Ef2(Y,Z)
E(f(Y,Z)−µ(Z))2.
56
APPENDIX A. PROOFS OF CHAPTER 2 57
Note that
E (f(Y, Z)− µ(Z))2 (A.2)
= EZ
[(f(Y, Z)− µ(Z))2 |Z
]
≤ EZ
[f 2(Y, Z)|Z
]
= Ef 2(Y, Z).
Thus η ≥ 1. Consider a new function f ′ = η[f(Y, Z)− µ(Z)]. Note that Ef ′(Y, Z) =
η[Ef(Y, Z) − Eµ(Z)] = 0 and E(f ′(Y, Z))2 = 1. Furthermore Eg(X)f ′(Y, Z) =
ηEg(X)f(Y, Z) ≥ Eg(X)f(Y, Z). Thus it is sufficient to consider f with the property
that E[f(Y, Z)|Z = z] = 0, which enable us to write the optimization problem for
solving ρ(X ; Y, Z) in the following equivalent form:
max Eg(X)f(Y, Z) (A.3)
subject to Eg(X) = 0,
EY |Z=zf(Y, z) = 0 ∀z,Eg(X) = Ef 2(Y, Z) = 1.
Define sz =√
E[f 2(Y, Z)|Z = z]. To simplify the notation, let pz = PZ(z) and
ρz = ρ(PX ⊗ PY |X,Z=z). We have the constraint:
∑
z
pzs2z = 1 (A.4)
Note that
maxEg(X)f(Y, Z) (A.5)
= maxEZ [Eg(X)f(Y, Z)|Z]= max
∑
z
pz [Eg(X)f(Y, Z)|Z = z]
(a)
≤ max∑
z
pzszρ(PX ⊗ PY |X,Z=z)
APPENDIX A. PROOFS OF CHAPTER 2 58
=∑
z
pzszρz
(b)
≤√∑
z
pzρ2z
where (a) is due to the fact that given Z = z, (X, Y ) has joint distribution PX ⊗PY |X,Z=z and (b) is based on the following argument:
Consider the following optimization problem with optimization variable sz, z ∈ Z
max∑
z
pzρzsz
subject to∑
z
pzs2z = 1, sz ≥ 0, ∀z
Using the method of Lagrange multiplier, we construct L(s, λ) =∑
z pzρzsz−λ∑
z pzs2z.
Solving ∂L∂sz
= pzρz−2λpzsz = 0, we obtain sz = ρz/(2λ), ∀z ∈ Z. Using the constraint∑
z pzs2z = 1, we have λ =
√∑z pzρ
2z/2, which yields
√∑z pzρ
2z as the maximum.
This completes the proof of Lemma 1.
Appendix B
Proofs of Chapter 3
B.1 Proof of the continuity of C(R) at R = 0
Fix an arbitrary ǫ > 0.
ǫ ≥ I(X ;U)− I(Y ;U) (B.1)(a)= I(X ;U |Y )
=∑
y
p(y)I(X ;U |Y = y)
=∑
y
p(y)D(p(x, u|Y = y)||p(x|Y = y)p(u|Y = y)).
Thus D (p(x, u|y)||p(x|y)p(u|y)) ≤ ǫminy∈Y p(y)
, ∀y ∈ Y . Via Pinkser’s inequality,∑
x |p(x)− q(x)| ≤√
2ln 2
D(p||q), we obtain
∑
x,u
|p(x, u|y)− p(x|y)p(u|y)| ≤ ǫ′ (B.2)
⇒∑
x,u
p(x|y)|p(u|x, y)− p(u|y)| ≤ ǫ′
⇒∑
x,u
p(x|y)|p(u|x)− p(u|y)| ≤ ǫ′,
59
APPENDIX B. PROOFS OF CHAPTER 3 60
where ǫ′ =√
ǫminy∈Y p(y)
and the last steps is due to the Markov chain U − X − Y .
Thus for each (x, y) pair such that p(x, y) > 0, we have |p(u|x)− p(u|y)| ≤ δ, where
δ = ǫmin(x,y):p(x,y)>0 p(x,y)
.
Let V be maximum common r.v. of p(x, y). There exist deterministic functions
g and f such that V = g(X) = f(Y ). For each v, pick an arbitrary y∗ from the y’s
such that g(y) = y. Thus we create a mapping y∗ = y∗(v).
We claim that for if x and y are in the same block, than |p(u|x) − p(u|y)| ≤ δ′,
where δ′ = (2|X |+1)δ. This is due to the fact that if x and y satisfy g(x) = f(y), then
there exists a sequence (x, y1), (x1, y1), (x1, y2), ..., (xn, y) such that the probability
of each pair is strictly positive [10]. Using triangle inequality:
|p(u|x)− p(u|y)|≤ |p(u|x)− p(u|y1)|+ |p(u|y1)− p(u|y)|≤ δ + |p(u|y1)− p(u|y)|≤ δ + |p(u|x1)− p(u|y1)|+ |p(u|x1)− p(u|y)|≤ 2δ + |p(u|x1)− p(u|y)|...
≤ (2n+ 1)δ
≤ (2|X |+ 1)δ
= δ′
Consider a new distribution p∗(x, y, u) = p(x, y)p(u|y∗(f(y))). Note that ||p − p∗||1goes to zeros as ǫ goes to 0. Therefore limǫ→0 |I(X ;U |V ) − I∗(X ;U |V )| = 0. Fur-
thermore under distribution p∗, X − V − U holds. Thus
limǫ→0
I(X ;U) = limǫ→0
I(X ;U, V )
= I(X ;V ) + limǫ→0
I∗(X ;U |V )
= I(X ;V )
= H(V )
Appendix C
Proofs of Chapter 4
C.1 Converse proof of Theorem 5
We bound the rate R as follows:
nR
≥ H(M)
= I(Xn;M)
= I(Xn;M,Y n)− I(Xn; Y n|M)
=n∑
i=1
I(Xi;M,Y n|X i−1)− I(Xn; Y n|M)
(a)=
n∑
i=1
I(Xi;M,Y n, X i−1)− I(Xn; Y n|M)
(b)=
n∑
i=1
I(Xi;M,Ai, Yn, X i−1)− I(Xn; Y n|M)
=
n∑
i=1
I(Xi;Ai) +
n∑
i=1
I(Xi;M,Y n\i, X i−1|Ai, Yi) +
n∑
i=1
I(Xi; Yi|Ai)− I(Xn; Y n|M)
=n∑
i=1
I(Xi;Ai) +n∑
i=1
I(Xi;M,Y n\i, X i−1|Ai, Yi) +n∑
i=1
[I(Yi;Xi|Ai)− I(Yi;X
n|Y i−1,M)]
61
APPENDIX C. PROOFS OF CHAPTER 4 62
(c)
≥n∑
i=1
I(Xi;Ai) +
n∑
i=1
I(Xi;M,Y n\i, X i−1|Ai, Yi)
=n∑
i=1
I(Xi;Ai) +n∑
i=1
I(Xi;K,M, Y n\i, X i−1|Ai, Yi)−n∑
i=1
I(Xi;K|M,Y n, X i−1, Ai)
(d)=
n∑
i=1
I(Xi;Ai) +
n∑
i=1
I(Xi;K,M, Y n\i, X i−1|Ai, Yi)−n∑
i=1
I(Xi;K|M,Y n, X i−1)
=n∑
i=1
I(Xi;Ai) +n∑
i=1
I(Xi;K,M, Y n\i, X i−1|Ai, Yi)− I(Xn;K|M,Y n)
≥n∑
i=1
I(Xi;Ai) +
n∑
i=1
I(Xi;K,M, Y n\i, X i−1|Ai, Yi)−H(K|M,Y n)
(e)
≥n∑
i=1
I(Xi;Ai) +
n∑
i=1
I(Xi;K,M, Y n\i, X i−1|Ai, Yi)−H(K|K ′)
≥n∑
i=1
I(Xi;Ai) +n∑
i=1
I(Xi;K,M, ,X i−1|Ai, Yi)−H(K|K ′)
(f)=
n∑
i=1
I(Xi;Ai) +
n∑
i=1
I(Xi;Ui|Ai, Yi)−H(K|K ′)
(g)=
n∑
i=1
I(Xi;Ai) +n∑
i=1
I(Xi;Ui|Ai)−n∑
i=1
I(Yi;Ui|Ai)−H(K|K ′)
=
n∑
i=1
I(Xi;Ai, Ui)−n∑
i=1
I(Yi;Ui|Ai)−H(K|K ′)
(h)= n
(I(XQ;AQ, UQ|Q)− I(YQ;UQ|AQ, Q)− H(K|K ′)
n
)
= n
(I(XQ;AQ, UQ, Q)− I(YQ;UQ|AQ)−
H(K|K ′)
n
)
≥ n
(I(XQ;AQ, UQ, Q)− I(YQ;UQ, Q|AQ)−
H(K|K ′)
n
)
where (a) is because Xis’ are i.i.d.; (b) and (d) are due to the fact An is a function
of M ; And (c) comes from the following chain of inequalities:
I(Yi;Xi|Ai)− I(Yi;Xn|Y i−1,M) (C.1)
= I(Yi;Xi|Ai)− I(Yi;Xn|Y i−1,M,An)
APPENDIX C. PROOFS OF CHAPTER 4 63
= H(Yi|Ai)−H(Yi|Y i−1,M,An)−H(Yi|Xi, Ai) +H(Yi|Xn, Y i−1,M,An)
= H(Yi|Ai)−H(Yi|Y i−1,M,An)−H(Yi|Xi, Ai) +H(Yi|Xi, Ai)
≥ 0
where the third equality comes from the Markov chain Yi−(Xi, Ai)−(Xn\i, Y i−1,M,An\i);
(e) is because K ′ is a function of M and Y n; in (f), we set Ui = (K,M,X i−1).
Note that Ui − (Xi, Ai) − Yi, which justifies (g); in (h), we introduce a time-sharing
random variable Q, which is uniformly distributed on {1, ..., n} and independent of
(Xn, K,M, Y n).
We bound the entropy of K as follows:
H(K)(a)= I(Xn;K) (C.2)
= I(Xi;K|X i−1)
= I(Xi;K,X i−1)
= I(Xi;Ui)
= nI(XQ;UQ|Q)
= nI(XQ;UQ, Q)
where (a) is due to the fact that K is a function of Xn. Set X = XQ, Y = YQ and
U = [UQ, Q], which finishes the proof.
C.2 Proof for initial efficiency with actions
The goal is to prove that
supp(a,u|x)
I(Y ;U |A)I(X ;A) + I(X ;U |A) = max
a∈Aρ2m(X, Y |A = a)
Define
∆1(PA) ={p(a, u|x) :
∑
x
p(a|x)pX(x) = pA(a), ∀a ∈ A}
APPENDIX C. PROOFS OF CHAPTER 4 64
∆2(δ) ={p(a, u|x) : I(X ;A) + I(X ;U |A) ≤ δ
}
That is ∆(PA) is the set of conditional distributions p(a, u|x) such that the induced
marginal distribution of A is PA and ∆2(δ) is the set of conditional distributions
p(a, u|x) such that I(X ;A) + I(X ;U |A) does not exceed δ.
supp(a,u|x)
I(Y ;U |A)I(X ;A) + I(X ;U |A)
= supPA
supδ≥0
sup∆(PA)
⋂∆2(δ)
I(Y ;U |A)I(X ;A) + I(X ;U |A)
= supPA
limδ↓0
sup∆(PA)
⋂∆2(δ)
I(Y ;U |A)I(X ;A) + I(X ;U |A)
where the last step can be proved by the following lemma on concavity argument:
Lemma 8. Fixing an arbitrary marginal distribution PA, define
f(δ) = sup∆(PA)
⋂∆2(δ)
I(Y ;U |A).
Then f(δ) is concave in δ.
Proof. Fixing the marginal distribution PA, consider any p(a1, u1|x) ∈ ∆(PA)⋂∆2(δ1)
and p(a2, u|x) ∈ ∆(PA)⋂∆2(δ2). Construct p(a, u|x) = λp(a1, u1|x)+(1−λ)p(a2, u2|x).
Introducing a time sharing r.v. Q which equals 1 w.p. λ and 2 w.p. 1− λ. We have
I(X ;A,U,Q) = I(X ;A,U |Q)
= λI(X ;A1, U1|Q1) + (1− λ)I(X ;A2, U2|Q2)
≤ λδ1 + (1− λ)δ2
and
I(Y ;U,Q|A) ≥ I(Y ;U |A,Q)
= λI(Y ;U1|A1) + (1− λ)I(Y ;U2|A2)
APPENDIX C. PROOFS OF CHAPTER 4 65
Note that (A,U ′) is a valid distribution, where U ′ = [U,Q]. Thus
f(λδ1 + (1− λ)δ2) ≥ λf(δ1) + (1− λ)f(δ2),
which completes the concavity proof of f .
Note that
I(Y ;U |A)I(X ;A) + I(X ;U |A) ≤ max
a:PA(a)>0
I(Y ;U |A = a)
I(X ;U |A = a)
≤ maxa:PA(a)>0
ρ2(PX|A=a ⊗ PY |X,A=a)
where the first inequality is a consequence of [4, Lemma 16.7.1] and the last inequality
comes from Lemma 4.
Therefore
supPA
limδ↓0
sup∆(PA)
⋂∆2(δ)
I(Y ;U |A)I(X ;A) + I(X ;U |A)
≤ suppA
limδ↓0
sup∆(PA)
⋂∆2(δ)
maxa:pA(a)>0
ρ2(PX|A=a ⊗ pY |X,A=a)
(a)= sup
PA
maxa∈A:PA(a)>0
ρ2(pX ⊗ PY |X,A=a)
= maxa∈A
ρ2(PX ⊗ PY |X,A=a)
where (a) can be proved by observing that for a fixed marginal distribution PA, δ ↓ 0
implies that ||PX − PX|A=a||l1 ↓ 0 for a ∈ A, PA(a) > 0 and ρ(P ′X ⊗ PY |X,A=a) as a
function of P ′X is uniformly continuous around P ′
X = PX . This upper bound is actually
achievable. We can simply fix the action a that maximizes maxa∈A ρ2(PX ⊗PY |X,A=a)
and use Lemma 4 to complete the proof.
APPENDIX C. PROOFS OF CHAPTER 4 66
C.3 Proof of Lemma 5
Set A⊥X
By symmetry, it is without loss of optimality to set A = 1. The maximum common
r.v. V between X and Y has the following form: V =
1, if X = 1;
2, if X = 2;
3, if X = 3, 4.
For any
U such that U −X − Y :
I(U ;X)− I(U ; Y )(a)= I(UV ;X)− I(V U ; Y )
= I(V ;X)− I(V ; Y ) + I(U ;X|V )− I(U ; Y |V )(b)= I(U ;X|V )− I(U ; Y |V )
(c)=
1
2[I(U ;X|V = 3)− I(U ; Y |V = 3)]
(d)=
1
2[I(U ;X|V = 3)− (1− p)I(U ;X|V = 3)]
=p
2I(U ;X|V = 3)
where (a) is due to Lemma. 3; (b) is because V is a deterministic function of X and
a deterministic function of Y ; (c) is due to the fact conditioned on V = 1 or V = 2,
X = Y ; (d) is because condition on V = 3, Y is an erased version of X .
On the other hand
I(U ;X) = I(U, V ;X)
= I(V ;X) + I(U ;X|V )
= H(V ) + I(U ;X|V )
= H(V ) +1
2I(U ;X|V = 3)
=3
2+
1
2I(U ;X|V = 3)
Thus the achievable (C,R) pair when A⊥X is of the form C = 32+ R
p. Note that
0 ≤ I(U ;X|V = 3) ≤ 1 thus 0 ≤ R ≤ p/q.
APPENDIX C. PROOFS OF CHAPTER 4 67
Correlate A with X through Fig. 4.3
We construct a r.v. V in the following way to facilitate the proof: V has the support
set {1, 2, 3} and is a deterministic function of (X,A):
• If A = 1, V =
1, if X = 1;
2, if X = 2;
3, if X = 3, 4.
• If A = 2, V =
1, if X = 3;
2, if X = 4;
3, if X = 1, 2.
Note that conditioned on A, V is the maximum common r.v. between X and Y . We
simply set U = V (This is not optimal in general but good enough to beat the A⊥X
choice for some R). The communication rate is
I(X ;A) + I(V ;X|A)− I(V ; Y |A) = I(X ;A) +H(V |A)−H(V |A)= I(X ;A)
= 1−H2(α)
On the other hand, the common randomness generated is
I(X ;A, V ) = I(X ;V |A) + I(X ;A)
= 1−H2(α) + I(X ;V |A)= 1−H2(α) +H(V |A)= 1−H2(α) +H(α,
1− α
2,1− α
2)
= 2− α
Appendix D
Proofs of Chapter 5
D.1 Proof of Lemma 6
Fixing a v, the function a = f(s, v) has only four possible forms: a = s, a = 1 − s,
a = 0 and a = 1. Thus, we can divide V into four groups:
V0 = {v : f(s, v) = s}V1 = {v : f(s, v) = 1− s}V2 = {v : f(s, v) = 0}V3 = {v : f(s, v) = 1} (D.1)
First, it is without loss of optimality to set V3 = ∅. That is because for each v ∈ V3,
we can change the function to f(s, v) = 0. The rate I(V ;S) + H(Y |V ) does not
change and the cost EA only decreases.
Rewrite the objective function in the following way
I(V ;S) +H(Y |V ) = H(Y |V )−H(S|V ) +H(S)
= H(S ⊕ A⊕ Z|V )−H(S|V ) +H(S)
=∑
v∈V0
(H2(p)−H(S|V = v)
)p(v)
68
APPENDIX D. PROOFS OF CHAPTER 5 69
+∑
v∈V1
(H2(p)−H(S|V = v)
)p(v)
+∑
v∈V2
(H(S ⊕ SN |V = v)−H(S|V = v)
)p(v)
where the last step is obtained by plugging in the actual form of a = f(s, v) for each
group of v.
Second, it is sufficient to have |V0| = 1 and |V1| = 1. To prove this, let v1, v2 ∈ V0.
Note that H(S|V = v) is a concave function in p(s|V = v). Thus if we replace v1, v2
by a v3 with p(v3) = p(v1) + p(v2) and
p(s|V = v3) =p(v1)
p(v1) + p(v2)p(s|V = v1) +
p(v2)
p(v1) + p(v2)p(s|V = v2),
we preserve the distribution of S, the cost EA but we reduce the first term, i.e.,∑
v∈V0
(H2(p)−H(S|V = v)
)p(v), in Eq. (D.2). Therefore, we can set V0 = {0} and
V1 = {1}.Third, note that for each v ∈ V2,
H(Y |V = v)−H(S|V = v)
= H(S ⊕ A⊕ Z|V = v)−H(S|V = v)
= H(S ⊕ SN |V = v)−H(S|V = v)
≥ 0 (D.2)
Last, if P(S = 0|V = 0) 6= P(S = 1|V = 1), consider a new auxiliary random
variable V ′ with the following distribution:
• V ′ = {0, 1, 2}, P(V ′ = 0) = P(V ′ = 1) = (P(V = 0) + P(V = 1))/2
• The function a = f(s, v′) is of the form: f(s, 0) = s, f(s, 1) = 1 − s and
f(s, 2) = 0.
• P(S = 0|V ′ = 2) = 1/2 and
APPENDIX D. PROOFS OF CHAPTER 5 70
P(S = 1|V ′ = 0) = P(S = 0|V ′ = 1)
=P(S = 1|V = 0)P(V = 0) + P(S = 0|V = 1)P(V = 1)
P(V = 0) + P(V = 1).
Comparing (S, V ′) with (S, V ), we can check that the cost EA and the distribution
of S are preserved. Meanwhile, the objective function is reduced, which completes
the proof.
D.2 Proof of Lemma 7
Similar to the proof of Lemma 6, we divide V in to V0,V1,V2,V3. Using the same
argument, we show that V3 = ∅. Rewrite the objective function H(Y |V ) in the
following way:
H(Y |V ) (D.3)
= H(S ⊕ A⊕ SN |V )
=∑
v∈V0
H2(p)p(v) +∑
v∈V1
H2(p)p(v) +∑
v∈V2
(H(S ⊕ SN |V = v)p(v)
= H2(p)∑
v∈V0⋃
V1
p(v) +∑
v∈V2
p(v),
which implies that it is sufficient to consider the case |V0| = 1, V1 = ∅ and |V2| = 1.
And this completes the proof.
Bibliography
[1] R. Ahlswede and I. Csiszar, “Common Randomness in Information Theory and
Cryptography – Part I: Secret sharing ”, IEEE Trans. Inf. Theory, vol. 39, no.
4, , pp. 1121– 1132, January, 1998.
[2] R. Ahlswede and I. Csiszar, “Common Randomness in Information Theory and
Cryptography – Part II: CR Capacity”, IEEE Trans. Inf. Theory, vol. 44, no. 1,
, pp. 225–240, January, 1998.
[3] R. F. Ahlswede, and J. Korner, “Source coding with side information and a
converse for degraded broadcast channels,” IEEE Trans. Inf. Theory, vol. 21,
no. 6, pp. 629-637, 1975.
[4] T. Cover and J. Thomas, “Elements of Information Theory”, John Wiley&Sons,
2nd Edition, 2006.
[5] I Csiszar and P. Narayan, “Common Randomness and Secret Key Generation
with a Helper”, IEEE Trans. Inf. Theory, vol. 46, no. 2, pp. 344–366, March,
2000.
[6] P. Cuff, T. Cover, and H. Permuter, “Coordination capacity,” IEEE Trans. Inf.
Theory, vol. 56, no. 9, pp. 4181–4206, September 2010.
[7] A. Dembo, A. Kagan, and L. A. Shepp, “Remarks on the maximum correlation
coefficient”, Bernoulli, no. 2, pp. 343–350, April 2001.
[8] A. El Gamal, and Y. H. Kim, “Lectures on Network Information Theory,” 2010,
available online at ArXiv: http://arxiv.org/abs/1001.3404.
71
BIBLIOGRAPHY 72
[9] E. Erkip and T. Cover, “The Effciency of Investment Information”, IEEE Trans.
Inf. Theory, vol. 44, no. 3, pp. 1026–1040, May 1998.
[10] P. Gacs and J. Korner, “Common information is far less than mutual informa-
tion”, Problems of Control and Information Theory, vol. 2, no. 2, pp. 119-162,
1972
[11] S. I. Gelfand and M. S. Pinsker, “Coding for Channel with Random Parameters,”
Probl. Contr. and Inform. Theory, vol. 9, no. I, pp. 1931, 1980.
[12] H. Gebelein, “Das statistische problem der Korrelation als variationsund Eigen-
wertproblem und sein Zusammenhang mit der Ausgleichungsrechnung,” Z. fur
angewandte Math. und Mech., vol. 21, pp. 364-379, 1941.
[13] C. Heegard and A. El Gamal,“On the Capacity of Computer Memory with De-
fects,” IEEE Trans. Inform. Theory, vol. 29, no. 5, pp. 731739, September 1983
[14] H. O. Hirschfeld, “A connection between correlation and contingency,” Proc.
Cambridge Philosophical Soc., vol. 31, pp. 520-524, 1935
[15] Y. H. Kim, A. Sutivong, and T.M. Cover, “State amplification,”, IEEE Trans.
Inform. Theory,” Vol. 54, no. 5, pp. 1850 – 1859, May 2008
[16] H. O. Lancaster, “ Some properties of the bivariate normal distribution consid-
ered in the form of a contingency table,” Biometrika, 44, pp. 289–292, 1957
[17] Pulkit Grover, Aaron Wagner, and Anant Sahai, “Information Embedding meets
Distributed Control”, IEEE Information Theory Workshop, January 2010 in
Cairo, Egypt.
[18] A. Renyi, “On measures of dependence”, Acta Mathematica Hungarica, vol. 10,
no.3-4, pp.441–451, 1959.
[19] C. Shannon, “A Mathematical Theory of Communication,” Bell System Techni-
cal Journal, Vol. 27, pp. 379-423, 623-656, 1948.
BIBLIOGRAPHY 73
[20] C. Shannon, “Channels with side information at the transmitter,” IBM J. Res.
Develop., Vol. 2, pp. 289-293, 1958.
[21] D. Slepian, J. Wolf, “Noiseless coding of correlated information sources”, IEEE
Trans. Inf. Theory, vol. 19, no. 4, pp. 471–480.
[22] S. Sigurjonsson, and Y. H. Kim, “On multiple user channels with causal state in-
formation at the transmitters,” in Proceedings of IEEE International Symposium
on Information Theory, Adelaide, Australia, Sep. 2005
[23] A. Sutivong, and T. Cover, “Rate vs. Distortion Trade-off for Channels with
State Information”, in Proceedings of the 2009 IEEE Symposium on Information
Theory, Lausanne, Switzerland, June 2002.
[24] O. Sumszyk, and Y. Steinberg, “Information embedding with reversible stego-
text”, in Proceedings of the 2009 IEEE Symposium on Information Theory, Seoul,
Korea, Jun. 2009
[25] N. Tishby, F.C. Pereira, and W. Bialek, “The Information Bottleneck method,”
The 37th annual Allerton Conference on Communication, Control, and Comput-
ing, Sept. 1999, pp. 368-377
[26] S. Verdu, “On channel capacity per unit cost,” IEEE Trans. Inf. Theory, vol.
36, no. 5, pp. 1019–1030, September 1990.
[27] T. Weissman and H. Permuter, “Source Coding with a Side Information ‘Vending
Machine’ ”, IEEE Trans. Inf. Theory, submitted 2009.
[28] H. S. Witsenhausen, “ On sequences of pairs of dependent random variables.”,
SIAM J. APPL. Math. vol. 28, no. 1, January 1975.
[29] A. Wyner, and J. Ziv, “A theorem on the entropy of certain binary sequences
and applications-I,” IEEE Trans. Inf. Theory, vol. 19, no. 6, pp. 769-772, 1973.
[30] A. Wyner and J. Ziv, “The rate distortion function for source coding with side
information at the receiver”, IEEE Trans. Inf. Theory, vol. 22, no. 1, pp. 1–10,
1976