1
Basics of information theory and information complexity
June 1, 2013
Mark Braverman
Princeton University
a tutorial
Part I: Information theory
• Information theory, in its modern format was introduced in the 1940s to study the problem of transmitting data over physical channels.
2
communication channel
Alice Bob
Quantifying “information”
• Information is measured in bits.
• The basic notion is Shannon’s entropy.
• The entropy of a random variable is the (typical) number of bits needed to remove the uncertainty of the variable.
• For a discrete variable: 𝐻 𝑋 ≔ ∑Pr 𝑋 = 𝑥 log 1/Pr[𝑋 = 𝑥]
3
Shannon’s entropy
• Important examples and properties:
– If 𝑋 = 𝑥 is a constant, then 𝐻 𝑋 = 0.
– If 𝑋 is uniform on a finite set 𝑆 of possible values, then 𝐻 𝑋 = log 𝑆.
– If 𝑋 is supported on at most 𝑛 values, then 𝐻 X ≤ log 𝑛.
– If 𝑌 is a random variable determined by 𝑋, then 𝐻 𝑌 ≤ 𝐻(𝑋).
4
Conditional entropy
• For two (potentially correlated) variables 𝑋, 𝑌, the conditional entropy of 𝑋 given 𝑌 is the amount of uncertainty left in 𝑋 given 𝑌:
𝐻 𝑋 𝑌 ≔ 𝐸𝑦~𝑌H X Y = y .
• One can show 𝐻 𝑋𝑌 = 𝐻 𝑌 + 𝐻(𝑋|𝑌).
• This important fact is knows as the chain rule.
• If 𝑋 ⊥ 𝑌, then 𝐻 𝑋𝑌 = 𝐻 𝑋 + 𝐻 𝑌 𝑋 = 𝐻 𝑋 + 𝐻 𝑌 .
5
Example
• 𝑋 = 𝐵1, 𝐵2, 𝐵3
• 𝑌 = 𝐵1 ⊕𝐵2 , 𝐵2 ⊕𝐵4 , 𝐵3 ⊕𝐵4 , 𝐵5
• Where 𝐵1, 𝐵2, 𝐵3, 𝐵4, 𝐵5 ∈𝑈 {0,1}.
• Then
– 𝐻 𝑋 = 3; 𝐻 𝑌 = 4; 𝐻 𝑋𝑌 = 5;
– 𝐻 𝑋 𝑌 = 1 = 𝐻 𝑋𝑌 − 𝐻 𝑌 ;
– 𝐻 𝑌 𝑋 = 2 = 𝐻 𝑋𝑌 − 𝐻 𝑋 .
6
Mutual information
• 𝑋 = 𝐵1, 𝐵2, 𝐵3
• 𝑌 = 𝐵1 ⊕𝐵2 , 𝐵2 ⊕𝐵4 , 𝐵3 ⊕𝐵4 , 𝐵5
7
𝐻(𝑋) 𝐻(𝑌) 𝐻(𝑌|𝑋) 𝐻(𝑋|𝑌)
𝐵1 𝐵1 ⊕𝐵2
𝐵2 ⊕𝐵3
𝐵4
𝐵5
𝐼(𝑋; 𝑌)
Mutual information
• The mutual information is defined as 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻(𝑌|𝑋)
• “By how much does knowing 𝑋 reduce the entropy of 𝑌?”
• Always non-negative 𝐼 𝑋; 𝑌 ≥ 0.
• Conditional mutual information: 𝐼 𝑋; 𝑌 𝑍 ≔ 𝐻 𝑋 𝑍 − 𝐻(𝑋|𝑌𝑍)
• Chain rule for mutual information: 𝐼 𝑋𝑌; 𝑍 = 𝐼 𝑋; 𝑍 + 𝐼(𝑌; 𝑍|𝑋)
• Simple intuitive interpretation.
8
Example – a biased coin
• A coin with 𝜀-Heads or Tails bias is tossed several times.
• Let 𝐵 ∈ {𝐻, 𝑇} be the bias, and suppose that a-priori both options are equally likely: 𝐻 𝐵 = 1.
• How many tosses needed to find 𝐵?
• Let 𝑇1, … , 𝑇𝑘 be a sequence of tosses.
• Start with 𝑘 = 2. 9
What do we learn about 𝐵? • 𝐼 𝐵; 𝑇1𝑇2 = 𝐼 𝐵; 𝑇1 + 𝐼 𝐵; 𝑇2 𝑇1 =
𝐼 𝐵; 𝑇1 + 𝐼 𝐵𝑇1; 𝑇2 − 𝐼 𝑇1; 𝑇2
≤ 𝐼 𝐵; 𝑇1 + 𝐼 𝐵𝑇1; 𝑇2 = 𝐼 𝐵; 𝑇1 + 𝐼 𝐵; 𝑇2 + 𝐼 𝑇1; 𝑇2 𝐵
= 𝐼 𝐵; 𝑇1 + 𝐼 𝐵; 𝑇2 = 2 ⋅ 𝐼 𝐵; 𝑇1 .
• Similarly, 𝐼 𝐵; 𝑇1…𝑇𝑘 ≤ 𝑘 ⋅ 𝐼 𝐵; 𝑇1 .
• To determine 𝐵 with constant accuracy, need 0 < c < 𝐼 𝐵; 𝑇1…𝑇𝑘 ≤ 𝑘 ⋅ 𝐼 𝐵; 𝑇1 .
• 𝑘 = Ω(1/𝐼(𝐵; 𝑇1)). 10
Kullback–Leibler (KL)-Divergence
• A distance metric between distributions on the same space.
• Plays a key role in information theory.
𝐷(𝑃 ∥ 𝑄) ≔ 𝑃 𝑥 log𝑃[𝑥]
𝑄[𝑥].
𝑥
• 𝐷(𝑃 ∥ 𝑄) ≥ 0, with equality when 𝑃 = 𝑄.
• Caution: 𝐷(𝑃 ∥ 𝑄) ≠ 𝐷(𝑄 ∥ 𝑃)!
11
Properties of KL-divergence
• Connection to mutual information: 𝐼 𝑋; 𝑌 = 𝐸𝑦∼𝑌𝐷(𝑋𝑌=𝑦 ∥ 𝑋).
• If 𝑋 ⊥ 𝑌, then 𝑋𝑌=𝑦 = 𝑋, and both sides are 0.
• Pinsker’s inequality:
𝑃 − 𝑄 1 = 𝑂( 𝐷(𝑃 ∥ 𝑄)).
• Tight!
𝐷(𝐵1/2+𝜀 ∥ 𝐵1/2) = Θ(𝜀2).
12
Back to the coin example
• 𝐼 𝐵; 𝑇1 = 𝐸𝑏∼𝐵𝐷(𝑇1,𝐵=𝑏 ∥ 𝑇1) = 𝐷 𝐵1
2±𝜀
∥ 𝐵12
= Θ 𝜀2 .
• 𝑘 = Ω1
𝐼 𝐵;𝑇1= Ω(
1
𝜀2).
• “Follow the information learned from the coin tosses”
• Can be done using combinatorics, but the information-theoretic language is more natural for expressing what’s going on.
Back to communication
• The reason Information Theory is so important for communication is because information-theoretic quantities readily operationalize.
• Can attach operational meaning to Shannon’s entropy: 𝐻 𝑋 ≈ “the cost of transmitting 𝑋”.
• Let 𝐶 𝑋 be the (expected) cost of transmitting a sample of 𝑋.
14
𝐻 𝑋 = 𝐶(𝑋)?
• Not quite.
• Let trit 𝑇 ∈𝑈 1,2,3 .
• 𝐶 𝑇 =5
3≈ 1.67.
• 𝐻 𝑇 = log 3 ≈ 1.58.
• It is always the case that 𝐶 𝑋 ≥ 𝐻(𝑋).
15
1 0
2 10
3 11
But 𝐻 𝑋 and 𝐶(𝑋) are close
• Huffman’s coding: 𝐶 𝑋 ≤ 𝐻 𝑋 + 1.
• This is a compression result: “an uninformative message turned into a short one”.
• Therefore: 𝐻 𝑋 ≤ 𝐶 𝑋 ≤ 𝐻 𝑋 + 1.
16
Shannon’s noiseless coding • The cost of communicating many copies of 𝑋
scales as 𝐻(𝑋).
• Shannon’s source coding theorem:
– Let 𝐶 𝑋𝑛 be the cost of transmitting 𝑛 independent copies of 𝑋. Then the amortized transmission cost
lim𝑛→∞
𝐶(𝑋𝑛)/𝑛 = 𝐻 𝑋 .
• This equation gives 𝐻(𝑋) operational meaning.
17
communication channel
𝐻 𝑋 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑎𝑙𝑖𝑧𝑒𝑑
18
𝑋1, … , 𝑋𝑛, … 𝐻(𝑋) per copy to transmit 𝑋’s
𝐻(𝑋) is nicer than 𝐶(𝑋)
• 𝐻 𝑋 is additive for independent variables.
• Let 𝑇1, 𝑇2 ∈𝑈 {1,2,3} be independent trits.
• 𝐻 𝑇1𝑇2 = log 9 = 2 log 3.
• 𝐶 𝑇1𝑇2 =29
9< 𝐶 𝑇1 + 𝐶 𝑇2 = 2 ×
5
3=
30
9.
• Works well with concepts such as channel capacity.
19
“Proof” of Shannon’s noiseless coding
• 𝑛 ⋅ 𝐻 𝑋 = 𝐻 𝑋𝑛 ≤ 𝐶 𝑋𝑛 ≤ 𝐻 𝑋𝑛 + 1.
• Therefore lim𝑛→∞
𝐶(𝑋𝑛)/𝑛 = 𝐻 𝑋 .
20
Additivity of entropy
Compression (Huffman)
Operationalizing other quantities
• Conditional entropy 𝐻 𝑋 𝑌 :
• (cf. Slepian-Wolf Theorem).
communication channel
𝑋1, … , 𝑋𝑛, … 𝐻(𝑋|𝑌) per copy to transmit 𝑋’s
𝑌1, … , 𝑌𝑛, …
communication channel
Operationalizing other quantities
• Mutual information 𝐼 𝑋; 𝑌 :
𝑋1, … , 𝑋𝑛, … 𝐼 𝑋; 𝑌 per copy to sample 𝑌’s
𝑌1, … , 𝑌𝑛, …
Information theory and entropy
• Allows us to formalize intuitive notions.
• Operationalized in the context of one-way transmission and related problems.
• Has nice properties (additivity, chain rule…)
• Next, we discuss extensions to more interesting communication scenarios.
23
Communication complexity
• Focus on the two party randomized setting.
24
A B
X Y A & B implement a functionality 𝐹(𝑋, 𝑌).
F(X,Y)
e.g. 𝐹 𝑋, 𝑌 = “𝑋 = 𝑌? ”
Shared randomness R
Communication complexity
A B
X Y
Goal: implement a functionality 𝐹(𝑋, 𝑌). A protocol 𝜋(𝑋, 𝑌) computing 𝐹(𝑋, 𝑌):
F(X,Y)
m1(X,R)
m2(Y,m1,R)
m3(X,m1,m2,R)
Communication cost = #of bits exchanged.
Shared randomness R
Communication complexity
• Numerous applications/potential applications (some will be discussed later today).
• Considerably more difficult to obtain lower bounds than transmission (still much easier than other models of computation!).
26
Communication complexity
• (Distributional) communication complexity with input distribution 𝜇 and error 𝜀: 𝐶𝐶 𝐹, 𝜇, 𝜀 . Error ≤ 𝜀 w.r.t. 𝜇.
• (Randomized/worst-case) communication complexity: 𝐶𝐶(𝐹, 𝜀). Error ≤ 𝜀 on all inputs.
• Yao’s minimax:
𝐶𝐶 𝐹, 𝜀 = max𝜇
𝐶𝐶(𝐹, 𝜇, 𝜀).
27
Examples
• 𝑋, 𝑌 ∈ 0,1 𝑛.
• Equality 𝐸𝑄 𝑋, 𝑌 ≔ 1𝑋=𝑌.
• 𝐶𝐶 𝐸𝑄, 𝜀 ≈ log1
𝜀.
• 𝐶𝐶 𝐸𝑄, 0 ≈ 𝑛.
28
Equality •𝐹is “𝑋 = 𝑌? ”. •𝜇is a distribution where w.p. ½𝑋 = 𝑌 and w.p. ½ (𝑋, 𝑌) are random.
A B
X Y
• Shows that 𝐶𝐶 𝐸𝑄, 𝜇, 2−129 ≤ 129.
MD5(X) [128 bits]
X=Y? [1 bit]
Error?
Examples
• 𝑋, 𝑌 ∈ 0,1 𝑛.
• Innerproduct𝐼𝑃 𝑋, 𝑌 ≔ ∑ 𝑋𝑖 ⋅ 𝑌𝑖𝑖 (𝑚𝑜𝑑2).
• 𝐶𝐶 𝐼𝑃, 0 = 𝑛 − 𝑜(𝑛).
In fact, using information complexity:
• 𝐶𝐶 𝐼𝑃, 𝜀 = 𝑛 − 𝑜𝜀(𝑛).
30
Information complexity
• Information complexity 𝐼𝐶(𝐹, 𝜀)::
communication complexity 𝐶𝐶 𝐹, 𝜀
as
• Shannon’s entropy 𝐻(𝑋) ::
transmission cost 𝐶(𝑋)
31
Information complexity
• The smallest amount of information Alice and Bob need to exchange to solve 𝐹.
• How is information measured?
• Communication cost of a protocol?
– Number of bits exchanged.
• Information cost of a protocol?
– Amount of information revealed.
32
Basic definition 1: The information cost of a protocol
• Prior distribution: 𝑋, 𝑌 ∼ 𝜇.
A B
X Y
Protocol π Protocol
transcript Π
𝐼𝐶(𝜋, 𝜇) = 𝐼(Π; 𝑌|𝑋) + 𝐼(Π; 𝑋|𝑌)
what Alice learns about Y + what Bob learns about X
Example •𝐹is “𝑋 = 𝑌? ”. •𝜇is a distribution where w.p. ½𝑋 = 𝑌 and w.p. ½ (𝑋, 𝑌) are random.
A B
X Y
𝐼𝐶(𝜋, 𝜇) = 𝐼(Π; 𝑌|𝑋) + 𝐼(Π; 𝑋|𝑌) ≈1 + 64.5 = 65.5 bits
what Alice learns about Y + what Bob learns about X
MD5(X) [128 bits]
X=Y? [1 bit]
Prior 𝜇 matters a lot for information cost!
• If 𝜇 = 1 𝑥,𝑦 a singleton,
𝐼𝐶 𝜋, 𝜇 = 0.
35
Example •𝐹is “𝑋 = 𝑌? ”. •𝜇is a distribution where (𝑋, 𝑌) are just uniformly random.
A B
X Y
𝐼𝐶(𝜋, 𝜇) = 𝐼(Π; 𝑌|𝑋) + 𝐼(Π; 𝑋|𝑌) ≈0 + 128 = 128 bits
what Alice learns about Y + what Bob learns about X
MD5(X) [128 bits]
X=Y? [1 bit]
Basic definition 2: Information complexity
• Communication complexity:
𝐶𝐶 𝐹, 𝜇, 𝜀 ≔ min𝜋𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠
𝐹𝑤𝑖𝑡ℎ𝑒𝑟𝑟𝑜𝑟≤𝜀
𝜋 .
• Analogously:
𝐼𝐶 𝐹, 𝜇, 𝜀 ≔ inf𝜋𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠
𝐹𝑤𝑖𝑡ℎ𝑒𝑟𝑟𝑜𝑟≤𝜀
𝐼𝐶(𝜋, 𝜇).
37
Needed!
Prior-free information complexity
• Using minimax can get rid of the prior.
• For communication, we had:
𝐶𝐶 𝐹, 𝜀 = max𝜇
𝐶𝐶(𝐹, 𝜇, 𝜀).
• For information
𝐼𝐶 𝐹, 𝜀 ≔ inf𝜋𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠
𝐹𝑤𝑖𝑡ℎ𝑒𝑟𝑟𝑜𝑟≤𝜀
max𝜇
𝐼𝐶(𝜋, 𝜇).
38
Ex: The information complexity of Equality
• What is 𝐼𝐶(𝐸𝑄, 0)?
• Consider the following protocol.
39 A B
X in {0,1}n Y in {0,1}n
A non-singular in 𝒁𝒏×𝒏
A1·X
A1·Y
A2·X
A2·Y
Continue for n steps, or until a disagreement is
discovered.
Analysis (sketch)
• If X≠Y, the protocol will terminate in O(1) rounds on average, and thus reveal O(1) information.
• If X=Y… the players only learn the fact that X=Y (≤1 bit of information).
• Thus the protocol has O(1) information complexity for any prior 𝜇.
40
Operationalizing IC: Information equals amortized communication
• Recall [Shannon]: lim𝑛→∞
𝐶(𝑋𝑛)/𝑛 = 𝐻 𝑋 .
• Turns out: lim𝑛→∞
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 = 𝐼𝐶 𝐹, 𝜇, 𝜀 ,
for 𝜀 > 0. [Error 𝜀 allowed on each copy]
• For 𝜀 = 0: lim𝑛→∞
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 0+)/𝑛 = 𝐼𝐶 𝐹, 𝜇, 0 .
• [ lim𝑛→∞
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 0)/𝑛 an interesting open
problem.] 41
Information = amortized communication
• lim𝑛→∞
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 = 𝐼𝐶 𝐹, 𝜇, 𝜀 .
• Two directions: “≤” and “≥”.
• 𝑛 ⋅ 𝐻 𝑋 = 𝐻 𝑋𝑛 ≤ 𝐶 𝑋𝑛 ≤ 𝐻 𝑋𝑛 + 1.
Additivity of entropy
Compression (Huffman)
The “≤” direction
• lim𝑛→∞
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 ≤ 𝐼𝐶 𝐹, 𝜇, 𝜀 .
• Start with a protocol 𝜋 solving 𝐹, whose 𝐼𝐶 𝜋, 𝜇 is close to 𝐼𝐶 𝐹, 𝜇, 𝜀 .
• Show how to compress many copies of 𝜋 into a protocol whose communication cost is close to its information cost.
• More on compression later.
43
The “≥” direction
• lim𝑛→∞
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 ≥ 𝐼𝐶 𝐹, 𝜇, 𝜀 .
• Use the fact that 𝐶𝐶 𝐹𝑛,𝜇𝑛,𝜀
𝑛≥
𝐼𝐶 𝐹𝑛,𝜇𝑛,𝜀
𝑛.
• Additivity of information complexity: 𝐼𝐶 𝐹𝑛, 𝜇𝑛, 𝜀
𝑛= 𝐼𝐶 𝐹, 𝜇, 𝜀 .
44
Proof: Additivity of information complexity
• Let 𝑇1(𝑋1, 𝑌1) and 𝑇2(𝑋2, 𝑌2) be two two-party tasks.
• E.g. “Solve 𝐹(𝑋, 𝑌) with error ≤ 𝜀 w.r.t. 𝜇”
• Then
𝐼𝐶 𝑇1 × 𝑇2, 𝜇1 × 𝜇2 = 𝐼𝐶 𝑇1, 𝜇1 + 𝐼𝐶 𝑇2, 𝜇2
• “≤” is easy.
• “≥” is the interesting direction.
𝐼𝐶 𝑇1, 𝜇1 + 𝐼𝐶 𝑇2, 𝜇2 ≤𝐼𝐶 𝑇1 × 𝑇2, 𝜇1 × 𝜇2
• Start from a protocol 𝜋 for 𝑇1 × 𝑇2 with prior 𝜇1 × 𝜇2, whose information cost is 𝐼.
• Show how to construct two protocols 𝜋1 for 𝑇1 with prior 𝜇1 and 𝜋2 for 𝑇2 with prior 𝜇2, with information costs 𝐼1 and 𝐼2, respectively, such that 𝐼1 + 𝐼2 = 𝐼.
46
𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )
𝜋1 𝑋1, 𝑌1 • Publicly sample 𝑋2 ∼ 𝜇2 • Bob privately samples
𝑌2 ∼ 𝜇2|X2
• Run 𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )
𝜋2 𝑋2, 𝑌2 • Publicly sample 𝑌1 ∼ 𝜇1 • Alice privately samples
𝑋1 ∼ 𝜇1|Y1
• Run 𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )
𝜋1 𝑋1, 𝑌1 • Publicly sample 𝑋2 ∼ 𝜇2 • Bob privately samples
𝑌2 ∼ 𝜇2|X2
• Run 𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )
Analysis - 𝜋1
• Alice learns about 𝑌1: 𝐼 Π; 𝑌1 X1X2)
• Bob learns about 𝑋1: 𝐼 Π; 𝑋1 𝑌1𝑌2𝑋2).
• 𝐼1 = 𝐼 Π; 𝑌1 X1X2) + 𝐼 Π; 𝑋1 𝑌1𝑌2𝑋2).
48
𝜋2 𝑋2, 𝑌2 • Publicly sample 𝑌1 ∼ 𝜇1 • Alice privately samples
𝑋1 ∼ 𝜇1|Y1
• Run 𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )
Analysis - 𝜋2
• Alice learns about 𝑌2: 𝐼 Π; 𝑌2 X1X2Y1)
• Bob learns about 𝑋2: 𝐼 Π; 𝑋2 𝑌1𝑌2).
• 𝐼2 = 𝐼 Π; 𝑌2 X1X2Y1) + 𝐼 Π; 𝑋2 𝑌1𝑌2).
49
Adding 𝐼1 and 𝐼2
𝐼1 + 𝐼2 = 𝐼 Π; 𝑌1 X1X2) + 𝐼 Π; 𝑋1 𝑌1𝑌2𝑋2)
+ 𝐼 Π; 𝑌2 X1X2Y1) + 𝐼 Π; 𝑋2 𝑌1𝑌2)
= 𝐼 Π; 𝑌1 X1X2) + 𝐼 Π; 𝑌2 X1X2Y1) +𝐼 Π; 𝑋2 𝑌1𝑌2) + 𝐼 Π; 𝑋1 𝑌1𝑌2𝑋2) =
𝐼 Π; 𝑌1𝑌2 𝑋1𝑋2 + 𝐼 Π; 𝑋2𝑋1 𝑌1𝑌2 = 𝐼.
50
Summary
• Information complexity is additive.
• Operationalized via “Information = amortized communication”.
• lim𝑛→∞
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 = 𝐼𝐶 𝐹, 𝜇, 𝜀 .
• Seems to be the “right” analogue of entropy for interactive computation.
51
Entropy vs. Information Complexity
Entropy IC
Additive? Yes Yes
Operationalized lim𝑛→∞
𝐶(𝑋𝑛)/𝑛 lim𝑛→∞
𝐶𝐶 𝐹𝑛, 𝜇𝑛, 𝜀
𝑛
Compression? Huffman:
𝐶 𝑋 ≤ 𝐻 𝑋 + 1 ???!
Can interactive communication be compressed?
• Is it true that 𝐶𝐶 𝐹, 𝜇, 𝜀 ≤ 𝐼𝐶 𝐹, 𝜇, 𝜀 + 𝑂(1)?
• Less ambitiously:
𝐶𝐶 𝐹, 𝜇, 𝑂(𝜀) = 𝑂 𝐼𝐶 𝐹, 𝜇, 𝜀 ?
• (Almost) equivalently: Given a protocol 𝜋 with 𝐼𝐶 𝜋, 𝜇 = 𝐼, can Alice and Bob simulate 𝜋 using 𝑂 𝐼 communication?
• Not known in general…
53
Direct sum theorems
• Let 𝐹 be any functionality.
• Let 𝐶(𝐹) be the cost of implementing 𝐹.
• Let 𝐹𝑛 be the functionality of implementing 𝑛 independent copies of 𝐹.
• The direct sum problem:
“Does 𝐶(𝐹𝑛) ≈ 𝑛 ∙ 𝐶(𝐹)?”
• In most cases it is obvious that 𝐶(𝐹𝑛) ≤ 𝑛 ∙ 𝐶(𝐹).
54
Direct sum – randomized communication complexity
• Is it true that
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?
• Is it true that 𝐶𝐶(𝐹𝑛, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜀))?
55
Direct product – randomized communication complexity
• Direct sum
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?
• Direct product
𝐶𝐶(𝐹𝑛, 𝜇𝑛, (1 − 𝜀)𝑛) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?
56
Direct sum for randomized CC and interactive compression
Direct sum:
• 𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?
In the limit:
• 𝑛 ⋅ 𝐼𝐶(𝐹, 𝜇, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?
Interactive compression:
• 𝐶𝐶 𝐹, 𝜇, 𝜀 = 𝑂(𝐼𝐶 𝐹, 𝜇, 𝜀) ?
Same question!
57
The big picture
𝐶𝐶(𝐹, 𝜇, 𝜀) 𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛
𝐼𝐶(𝐹, 𝜇, 𝜀) 𝐼𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛
additivity (=direct sum)
for information
information = amortized
communication direct sum for
communication?
interactive compression?
Current results for compression A protocol 𝜋that has 𝐶 bits of communication, conveys 𝐼 bits of information over prior 𝜇, and works in 𝑟 rounds can be simulated:
• Using 𝑂 (𝐼 + 𝑟) bits of communication.
• Using 𝑂 𝐼 ⋅ 𝐶 bits of communication.
• Using 2𝑂(𝐼) bits of communication.
• If 𝜇 = 𝜇𝑋 × 𝜇𝑌, then using 𝑂(𝐼 polylog 𝐶) bits of communication. 59
Their direct sum counterparts
• 𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω (𝑛1/2 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀)).
• 𝐶𝐶(𝐹𝑛, 𝜀) = Ω (𝑛1/2 ∙ 𝐶𝐶(𝐹, 𝜀)).
For product distributions 𝜇 = 𝜇𝑋 × 𝜇𝑌,
• 𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω (𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀)).
When the number of rounds is bounded by 𝑟 ≪ 𝑛, a direct sum theorem holds.
60
Direct product
• The best one can hope for is a statement of the type:
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 1 − 2−𝑂 𝑛 ) = Ω(𝑛 ∙ 𝐼𝐶(𝐹, 𝜇, 1/3)).
• Can prove:
𝐶𝐶 𝐹𝑛, 𝜇𝑛, 1 − 2−𝑂 𝑛 = Ω (𝑛1/2 ∙ 𝐶𝐶(𝐹, 𝜇, 1/3)).
61
Proof 2: Compressing a one-round protocol
• Say Alice speaks: 𝐼𝐶 𝜋, 𝜇 = 𝐼 𝑀;𝑋 𝑌 .
• Recall KL-divergence: 𝐼 𝑀; 𝑋 𝑌 = 𝐸𝑌𝐷 𝑀𝑋𝑌 ∥ 𝑀𝑌 = 𝐸𝑌𝐷 𝑀𝑋 ∥ 𝑀𝑌
• Bottom line:
– Alice has 𝑀𝑋; Bob has 𝑀𝑌;
– Goal: sample from 𝑀𝑋 using ∼ 𝐷(𝑀𝑋 ∥ 𝑀𝑌) communication.
62
The dart board
q1 q2 q3 q4 q5 q6 q7 …. u1 u2 u3 u4 u5 u6 u7
1
0
• Interpret the public randomness as random points in 𝑈 × [0,1], where 𝑈 is the universe of all possible messages.
• First message under the histogram of 𝑀 is distributed ∼ 𝑀.
𝑈
u1
u2
u3 u4
u5
64
Proof Idea • Sample using 𝑂(log1/𝜀 + 𝐷(𝑀𝑋 ∥ 𝑀𝑌))
communication with statistical error ε.
MX MY
u1 u1
u2 u2
u3 u3
u4 u4
u4
~|U| samples Public randomness:
q1 q2 q3 q4 q5 q6 q7 …. u1 u2 u3 u4 u5 u6 u7
MX MY 1 1
0 0
Proof Idea • Sample using 𝑂(log1/𝜀 + 𝐷(𝑀𝑋 ∥ 𝑀𝑌))
communication with statistical error ε.
u4 u2
h1(u4) h2(u4)
65 MX MY u4
1 1
0 0 u2
MX MY
66 66
Proof Idea • Sample using 𝑂(log1/𝜀 + 𝐷(𝑀𝑋 ∥ 𝑀𝑌))
communication with statistical error ε.
u4 u2
h4(u4)… hlog 1/ ε(u4)
u4
h3(u4)
MX 2MY
MX MY u4
u4
h1(u4), h2(u4)
1 1
0 0
67
Analysis
• If 𝑀𝑋(𝑢4) ≈ 2𝑘𝑀𝑌(𝑢4), then the protocol
will reach round 𝑘 of doubling.
• There will be ≈ 2𝑘 candidates.
• About 𝑘 + log1/𝜀 hashes to narrow to one.
• The contribution of 𝑢4 to cost:
– 𝑀𝑋(𝑢4)(log𝑀𝑋(𝑢4)/𝑀𝑌(
𝑢4) + log1/𝜀).
Done! 𝐷(𝑀𝑋 ∥ 𝑀𝑌) ≔ 𝑀𝑋(𝑢) log𝑀𝑋(𝑢)
𝑀𝑌(𝑢).
𝑢
External information cost
• (𝑋, 𝑌)~𝜇.
A B
X Y
Protocol π Protocol
transcript π
𝐼𝐶𝑒𝑥𝑡(𝜋, 𝜇) = 𝐼(Π; 𝑋𝑌)
what Charlie learns about (X,Y)
C
Example
69
•F is “X=Y?”. •μ is a distribution where w.p. ½ X=Y and w.p. ½ (X,Y) are random.
MD5(X)
X=Y?
A B
X Y
𝐼𝐶𝑒𝑥𝑡(𝜋, 𝜇) = 𝐼(Π; 𝑋𝑌) = 129𝑏𝑖𝑡𝑠
what Charlie learns about (X,Y)
External information cost
• It is always the case that 𝐼𝐶𝑒𝑥𝑡 𝜋, 𝜇 ≥ 𝐼𝐶 𝜋, 𝜇 .
• If 𝜇 = 𝜇𝑋 × 𝜇𝑌 is a product distribution, then
𝐼𝐶𝑒𝑥𝑡 𝜋, 𝜇 = 𝐼𝐶 𝜋, 𝜇 .
70
External information complexity
• 𝐼𝐶𝑒𝑥𝑡 𝐹, 𝜇, 𝜀 ≔ inf𝜋𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠
𝐹𝑤𝑖𝑡ℎ𝑒𝑟𝑟𝑜𝑟≤𝜀
𝐼𝐶𝑒𝑥𝑡(𝜋, 𝜇).
• Can it be operationalized?
71
Operational meaning of 𝐼𝐶𝑒𝑥𝑡?
• Conjecture: Zero-error communication scales like external information:
lim𝑛→∞
𝐶𝐶 𝐹𝑛, 𝜇𝑛, 0
𝑛= 𝐼𝐶𝑒𝑥𝑡 𝐹, 𝜇, 0 ?
• Recall:
lim𝑛→∞
𝐶𝐶 𝐹𝑛, 𝜇𝑛, 0+
𝑛= 𝐼𝐶 𝐹, 𝜇, 0 .
72
Example – transmission with a strong prior
• 𝑋, 𝑌 ∈ 0,1
• 𝜇 is such that 𝑋 ∈𝑈 0,1 , and 𝑋 = 𝑌 with a very high probability (say 1 − 1/ 𝑛).
• 𝐹 𝑋, 𝑌 = 𝑋 is just the “transmit 𝑋” function.
• Clearly, 𝜋 should just have Alice send 𝑋 to Bob.
• 𝐼𝐶 𝐹, 𝜇, 0 = 𝐼𝐶 𝜋, 𝜇 = 𝐻1
𝑛= o(1).
• 𝐼𝐶𝑒𝑥𝑡 𝐹, 𝜇, 0 = 𝐼𝐶𝑒𝑥𝑡 𝜋, 𝜇 = 1. 73
Example – transmission with a strong prior
• 𝐼𝐶 𝐹, 𝜇, 0 = 𝐼𝐶 𝜋, 𝜇 = 𝐻1
𝑛= o(1).
• 𝐼𝐶𝑒𝑥𝑡 𝐹, 𝜇, 0 = 𝐼𝐶𝑒𝑥𝑡 𝜋, 𝜇 = 1.
• 𝐶𝐶 𝐹𝑛, 𝜇𝑛, 0+ = 𝑜 𝑛 .
• 𝐶𝐶 𝐹𝑛, 𝜇𝑛, 0 = Ω(𝑛).
Other examples, e.g. the two-bit AND function fit into this picture.
74
Additional directions
75
Information complexity
Interactive coding
Information theory in TCS
Interactive coding theory
• So far focused the discussion on noiseless coding.
• What if the channel has noise?
• [What kind of noise?]
• In the non-interactive case, each channel has a capacity 𝐶.
76
Channel capacity
• The amortized number of channel uses needed to send 𝑋 over a noisy channel of capacity 𝐶 is
𝐻 𝑋
𝐶
• Decouples the task from the channel!
77
Example: Binary Symmetric Channel
• Each bit gets independently flipped with probability 𝜀 < 1/2.
• One way capacity 1 − 𝐻 𝜀 .
78
0
1
0
1 1 − 𝜀
1 − 𝜀
𝜀 𝜀
Interactive channel capacity
• Not clear one can decouple channel from task in such a clean way.
• Capacity much harder to calculate/reason about.
• Example: Binary symmetric channel.
• One way capacity 1 − 𝐻 𝜀 .
• Interactive (for simple pointer jumping,
[Kol-Raz’13]):
1 − Θ 𝐻 𝜀 .
0
1
0
1 1 − 𝜀
1 − 𝜀
𝜀
Information theory in communication complexity and beyond
• A natural extension would be to multi-party communication complexity.
• Some success in the number-in-hand case.
• What about the number-on-forehead?
• Explicit bounds for ≥ log 𝑛 players would imply explicit 𝐴𝐶𝐶0 circuit lower bounds.
80
Naïve multi-party information cost
𝐼𝐶 𝜋, 𝜇 = 𝐼(Π; 𝑋|𝑌𝑍)+ 𝐼(Π; 𝑌|𝑋𝑍)+ 𝐼(Π; 𝑍|𝑋𝑌)
81
A B C
YZ XZ XY
Naïve multi-party information cost
𝐼𝐶 𝜋, 𝜇 = 𝐼(Π; 𝑋|𝑌𝑍)+ 𝐼(Π; 𝑌|𝑋𝑍)+ 𝐼(Π; 𝑍|𝑋𝑌)
• Doesn’t seem to work.
• Secure multi-party computation [Ben-Or,Goldwasser, Wigderson], means that anything can be computed at near-zero information cost.
• Although, these construction require the players to share private channels/randomness.
82
Communication and beyond…
• The rest of today:
– Data structures;
– Streaming;
– Distributed computing;
– Privacy.
• Exact communication complexity bounds.
• Extended formulations lower bounds.
• Parallel repetition?
• … 83
84
Thank You!
Open problem: Computability of IC
• Given the truth table of 𝐹 𝑋, 𝑌 , 𝜇 and 𝜀, compute 𝐼𝐶 𝐹, 𝜇, 𝜀 .
• Via 𝐼𝐶 𝐹, 𝜇, 𝜀 = lim𝑛→∞
𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 can
compute a sequence of upper bounds.
• But the rate of convergence as a function of 𝑛 is unknown.
85
Open problem: Computability of IC
• Can compute the 𝑟-round𝐼𝐶𝑟 𝐹, 𝜇, 𝜀 information complexity of 𝐹.
• But the rate of convergence as a function of 𝑟 is unknown.
• Conjecture:
𝐼𝐶𝑟 𝐹, 𝜇, 𝜀 − 𝐼𝐶 𝐹, 𝜇, 𝜀 = 𝑂𝐹,𝜇,𝜀1
𝑟2.
• This is the relationship for the two-bit AND.
86