EE514a Information Theory I Fall Quarter 2019

EE514a – Information Theory IFall Quarter 2019

Prof. Jeff Bilmes

University of Washington, SeattleDepartment of Electrical & Computer Engineering

Fall Quarter, 2019https://class.ece.uw.edu/514/bilmes/ee514_fall_2019/

Lecture 10 - Oct 30th, 2019

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F1/54(pg.1/220)

https://class.ece.uw.edu/514/bilmes/ee514_fall_2019/

Logistics Review

Class Road Map - IT-I

L1 (9/25): Overview, Communications,Information, Entropy

L2 (9/30): Entropy, Mutual Information,KL-Divergence

L3 (10/2): More KL, Jensen, more Venn,Log Sum,Data Proc. Inequality

L4 (10/7): Data Proc. Ineq.,thermodynamics, Stats, Fano,

L5 (10/9): M. of Conv, AEP,

L6 (10/14): AEP, Source Coding, Types

LX (10/16): Makeup

L7 (10/21): Types, Univ. Src Coding,Stoc. Procs, Entropy Rates

L8 (10/23): Entropy rates, HMMs,Coding

L9 (10/28): Kraft ineq., Shannon Codes,Kraft ineq. II, Huffman

L10 (10/30): Huffman,Shannon/Fano/Elias

L11 (11/4):

LXX (11/6): In class midterm exam

L12 (11/11): Veterans Day (Makeuplecture)

L13 (11/13):

L14 (11/18):

L15 (11/20):

L16 (11/25):

L17 (11/27):

L18 (12/2):

L19 (12/4):

LXX (12/10): Final exam

Finals Week: December 9th–13th.


Logistics Review

Cumulative Outstanding Reading

Read chapters 1 and 2 in our book (Cover & Thomas, “InformationTheory”) (including Fano’s inequality).

Read chapters 3 and 4 in our book (Cover & Thomas, “InformationTheory”).

Read sections 11.1 through 11.3 in our book (Cover & Thomas,“Information Theory”).

Read chapter 4 in our book (Cover & Thomas, “InformationTheory”).


Logistics Review

Homework

Homework 1 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), wasdue Tuesday, Oct 8th, 11:55pm.

Homework 2 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), dueFriday 10/18/2019, 11:45pm.

Homework 3 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), dueTuesday 10/29/2019, 11:45pm.


https://canvas.uw.edu/courses/1319497/assignments



Logistics Review

Kraft inequality

Theorem 10.2.1 (Kraft inequality)

For any instantaneous code (prefix code) over alphabet of size D, thecodeword lengths `1, `2, . . . , `m must satisfy∑

i

D−ì ≤ 1 (10.1)

Conversely, given a set of codeword lengths satisfying the aboveinequality, ∃ an instantaneous code with these word lengths.

Note: converse says there exists a code with these lengths, not thatall codes with these lengths will satisfy the inequality.Key point: for ì satisfying Kraft, no further restriction imposed byalso wanting a prefix code, so we might as well use a prefix code(assuming it is easy to find given the lengths)Connects code existence to mathematical property on lengths!Given Kraft lengths, can construct an instantaneous code (as we willsee). Given lengths, can compute E[`] and compare with H.


Logistics Review

Towards Optimal Codes

Summarizing: Prefix code ⇔ Kraft inequality.

Thus, we need only find lengths that satisfy Kraft to find a prefixcode.

Goal: find a prefix code with minimum expected length

L(C) =∑i

piì (10.5)

This is a constrained optimization problem:

minimize{`1:m}∈Zm

++

∑i

piì (10.6)

subject to∑i

D−ì ≤ 1

Integer program is an NP-complete optimization, not likely to beefficiently solvable (unless P=NP).


Logistics Review

Towards Optimal Codes

Relax the integer constraints on ì for now, and consider Lagrangian

J =∑i

piì + λ(∑i

D−ì − 1) (10.5)

Taking derivatives and setting to 0,

∂J

∂ì= pi − λD−ì lnD = 0 (10.6)

⇒ D−ì =pi

λ lnD(10.7)

∂J

∂λ=∑i

D−ì − 1 = 0 ⇒ λ = 1/ lnD (10.8)

⇒ D−ì = pi yielding `∗i = − logD pi (10.9)


Logistics Review

Optimal Code Lengths

Theorem 10.2.2

Entropy is the minimum expected length. That is, the expected length Lof any instantaneous D-ary code (which thus satisfies Kraft inequality)for a r.v. X is such that

L ≥ HD(X) (10.6)

with equality iff D−ì = pi.


Logistics Review

Optimal Code Lengths

. . . Proof of Theorem ??.

So we have that L ≥ HD(X).

Equality, L = H is achieved iff pi = D−ì for all i ⇔ − logD pi is aninteger . . .

. . . in which case c =∑

iD−ì = 1

Definition 10.2.2 (D-adic)

A probability distribution is called D-adic w.r.t. D if each of theprobabilities is = D−n for some n.

Ex: when D = 2, the distribution [12 ,

14 ,

18 ,

18 ] = [2−1, 2−2, 2−3, 2−3]

is 2-adic.

Thus, we have equality above iff the distribution is appropriatelyD-adic.


Logistics Review

Shannon Codes

L−H = D(p||r) + logD 1/c, with c =∑

iD−ì

Thus, to produce a code, we find closest (in the KL sense) D-adicdistribution w.r.t. D to p and then construct the code as in theproof of the Kraft inequality converse.

In general, however, unless P=NP, it is hard to find the KL closestD-adic distribution (integer programming problem).

Shannon codes: consider ì = dlogD 1/pie as the code lengths∑iD−ì =

∑iD−dlog 1/pie ≤∑iD

− log 1/pi =∑

i pi = 1

This means Kraft inequality holds for these lengths, so there is aprefix code (if the lengths were too short there might be a problembut we’re rounding up).

Also, we have a bound on lengths in terms of real numbers

logD1

pi≤ ì < logD

1

pi+ 1 (10.12)


Logistics Review

How bad is one bit?

How bad is this overhead?

Depends on H. Efficiency of code

0 ≤ Efficiency ,HD(X)

E`(X)≤ 1 (10.14)

If E`(X) = HD(X) + 1, then efficiency → 1 as H(X)→∞.

efficiency → 0 as H(X)→ 0, so entropy would need to be verylarge for this to be good.

For small alphabets (or low-entropy distributions, such as close todeterministic distributions), impossible to have good efficiency. E.g.,D = {0, 1} then maxH(X) = 1, so best possible efficiency is 50%/.


Logistics Review

Improving efficiency

Such symbol codes are inherently disadvantaged, unless theirdistributions are D-adic.

We can reduce overhead (improve efficiency) by coding > 1 symbolat a time (block code, or a vector code, the symbol is the vector).

Let Ln be the expected per-symbol length for n symbols x1:n. Ln isthe expected per-symbol length, when encoding n symbols.

Ln =1

n

∑x1:n

p(x1:n)`(x1:n) =1

nE`(x1:n) (10.14)

Lets use Shannon coding lengths to get∑i

pi

(log 1/pi ≤ ì ≤ log 1/pi + 1

)(10.15)

⇒ H(X1, . . . , Xn) ≤ E`(X1:n) < H(X1, . . . , Xn) + 1 (10.16)


Logistics Review

Coding with the wrong distribution

Theorem 10.2.4

Expected length under p(x)of code with `(x) = dlog 1/q(x)e satisfies

H(p) +D(p||q) ≤ Ep`(X) ≤ H(p) +D(p||q) + 1 (10.22)

l.h.s. is the best we can do with the wrong distribution q when thetrue distribution is p.


Logistics Review

Kraft revisited

We proved Kraft inequality is true for instantaneous codes (and viceverse).Could it be true for all uniquely decodable codes?Could larger class of codes have shorter expected codeword lengths?Since larger, we might (naıvely) expect that we could do better.

Theorem 10.2.4

Codeword lengths of any uniquely decodable code (not. nec.instantaneous) must satisfy Kraft inequality

∑iD−ì ≤ 1. Conversely,

given a set of codeword lengths that satisfy Kraft, it is possible toconstruct a uniquely decodable code.

Proof.

Proof converse we already saw before (given lengths, we can construct aprefix code which is thus uniquely decodable). Thus we only need provethe first part. . . .


Logistics Review

Huffman coding

A procedure for finding shortest expected length prefix code.

You’ve probably encountered it in computer science classes (a classicalgorithm).

Here we analyze it armed with the tools of information theory.

Quest: given a p(x), find a code (bit strings and set of lengths) thatis as short as possible, and also an instantaneous code (prefix free).

We could do this greedily: start at the top and split the potentialcodewords into even probabilities (i.e., asking the question withhighest entropy)

This is similar to the game of 20 questions. We have a set ofobjects, w.l.o.g. the set S = {1, 2, 3, 4, . . . ,m} that occur withfrequency proportional to non-negative (w1, w2, . . . , wm).

We wish to determine an object from this class asking as fewquestions as possible.

Supposing X ∈ S, each question can take the form “Is X ∈ A?” forsome A ⊆ S.


Logistics Review

20 Questions

Question tree. S = {x1, x2, x3, x4, x5}.

X ∈ {x2, x3}

X ∈ {x2

x2

x3

x1

x4

x5

}

X ∈ {x1}

X ∈ {x4}

0.2

0.2

0.30.15

0.15

Y

Y

Y

YN

N

NN

How do we construct such a tree? Charles Sanders Peirce, 1901 said:Thus twenty skillful hypotheses will ascertain what two hundredthousand stupid ones might fail to do. The secret of the businesslies in the caution which breaks a hypothesis up into its smallestlogical components, and only risks one of them at a time.


Logistics Review

The Greedy Method for Finding a Code

Suggests a greedy method. “Do next whatever currently looks best.”

Consider following table:a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

The question that looks best would infer the most about thedistribution, one with the largest entropy.

H(X|Y1) = H(X,Y1)−H(Y1) = H(X)−H(Y1), so choosing aquestion Y1 with large entropy leads to least “residual” uncertaintyH(X|Y1) about X.

Identically, we choose the question Y1 with the greatest mutualinformation about X since in this caseI(Y1;X) = H(X)−H(X|Y1) = H(Y1).

Again, questions take the form “Is X ∈ A?” for some A ⊆ S, sochoosing a yes/no (binary) question means choosing the set A.


Huffman Shannon/Fano/Elias Next

The Greedy Method

We’ll use greedy, and choose the question (set) with the greatestentropy.

If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.

This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).

Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.

Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).



The Greedy Method








The Greedy Method








The Greedy Method








The Greedy Method








The Greedy Tree

If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.

This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).

If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:

case I II III

split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)

H(Y2|Y1 = 1) 0.3274 0.2423 0.1414

Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274

Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)



The Greedy Tree




case I II III


H(Y2|Y1 = 1) 0.3274 0.2423 0.1414





The Greedy Tree



If Y1 = 1, then we need to partition the set {e, f, g}.

We can do this in one of three ways:case I II III


H(Y2|Y1 = 1) 0.3274 0.2423 0.1414





The Greedy Tree




case I II III


H(Y2|Y1 = 1) 0.3274 0.2423 0.1414





The Greedy Tree




case I II III


H(Y2|Y1 = 1) 0.3274 0.2423 0.1414





The Greedy Tree




case I II III


H(Y2|Y1 = 1) 0.3274 0.2423 0.1414





The Greedy Tree

Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.

Summarizing all questions/splits, & their conditional entropies:set split probabilities conditional entropy

{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183

Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall

H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)



The Greedy Tree


Summarizing all questions/splits, & their conditional entropies:

set split probabilities conditional entropy{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183


H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)



The Greedy Tree


Summarizing all questions/splits, & their conditional entropies:

set split probabilities conditional entropy{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183


H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)



The Greedy Tree





H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)



The Greedy Tree





H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)



The Greedy Tree





H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)



The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

This leads to the following (top-down greedily constructed) tree:

{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03

The expected length of this code E` = 2.5300.

Entropy: H = 1.9323.

Code efficiency H/E` = 1.9323/2.5300 = 0.7638.

Can we do better?



The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03




Can we do better?



The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02


X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03




Can we do better?



The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02


X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03




Can we do better?



The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02


X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03




Can we do better?



The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02


X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03




Can we do better?Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.40/220)


The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman

{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1

The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.




{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1





{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1

The Huffman lengths have E`huffman = 1.9700

Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.




{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1

The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809

Key problem: Greedy procedure is not optimal in this case.




{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1




Greedy

Why does starting from the top and splitting as such non-optimal?Where can it go wrong?

Ex: There may be many ways to get a ≈ 50% split (to achieve highentropy) once done, the split is irrevocable and there is no way toknow if the consequences of that split might hurt down the line.



Greedy

Why does starting from the top and splitting as such non-optimal?Where can it go wrong?

Ex: There may be many ways to get a ≈ 50% split (to achieve highentropy) once done, the split is irrevocable and there is no way toknow if the consequences of that split might hurt down the line.



Huffman

The Huffman code tree procedure

1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,

and will differ in the last digit.3 Combine these two symbols into a joint symbol having probability

equal to the sum, add the joint symbol and then remove the twosymbols, and repeat.

Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).



Huffman

The Huffman code tree procedure1 take the two least probable symbols in the alphabet.

2 These two will be given the longest codewords, will have equal length,and will differ in the last digit.

3 Combine these two symbols into a joint symbol having probabilityequal to the sum, add the joint symbol and then remove the twosymbols, and repeat.




Huffman

The Huffman code tree procedure1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,

and will differ in the last digit.

3 Combine these two symbols into a joint symbol having probabilityequal to the sum, add the joint symbol and then remove the twosymbols, and repeat.




Huffman







Huffman







Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.

So 4 and 5 should have longest code length

We build the tree from left to right.

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.



Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length







Huffman



X12345

0.250.250.20.150.15

probcodeword






Huffman



X12345

0.250.250.20.150.15

probcodeword

0.250.250.20.3

probstep 1

01






Huffman



X12345

0.250.250.20.150.15

probcodeword

0.250.250.20.3

prob

0.250.45

0.3

probstep 1 step 2

01

01






Huffman



X12345

0.250.250.20.150.15

probcodeword

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

probstep 1 step 2 step 3

01

01

0

1






Huffman



X12345

0.250.250.20.150.15

probcodeword

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

prob

1.0probstep 1 step 2 step 3 step 4

01

01

01

0

1






Huffman



Xlog1

p(x)

12345

22233

2.02.02.32.72.7

001011010011

0.250.250.20.150.15

probcodewordlength

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

prob


01

01

01

0

1






Huffman



Xlog1

p(x)

12345

22233

2.02.02.32.72.7

001011010011

0.250.250.20.150.15

probcodewordlength

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

prob


01

01

01

0

1






Huffman



Xlog1

p(x)

12345

22233

2.02.02.32.72.7

001011010011

0.250.250.20.150.15

probcodewordlength

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

prob


01

01

01

0

1






More Huffman vs. Shannon

Shannon code lengths ì = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths à = 1 and`b = 14 bits, with E` = 1.0013 > 1.

Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.

Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).

But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.

In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.





Optimal code lengths are not always ≤ dlog 1/pie.

Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.






































Optimality of Huffman

Huffman is optimal, i.e.,∑

i piì is minimal, for integer lengths.

To show this:

1 First show lemma that some optimal codes have certain properties(not all, but that ∃ optimal code with these properties).

2 Given a code Cm for m symbols, that has said properties, producenew simpler code satisfying lemma and is simpler to optimize.

3 Ultimately get down to simple case of two symbols which are obviousto optimize.






To show this:

1 First show lemma that some optimal codes have certain properties(not all, but that ∃ optimal code with these properties).








To show this:1 First show lemma that some optimal codes have certain properties

(not all, but that ∃ optimal code with these properties).









(not all, but that ∃ optimal code with these properties).2 Given a code Cm for m symbols, that has said properties, produce

new simpler code satisfying lemma and is simpler to optimize.








(not all, but that ∃ optimal code with these properties).2 Given a code Cm for m symbols, that has said properties, produce

new simpler code satisfying lemma and is simpler to optimize.3 Ultimately get down to simple case of two symbols which are obvious

to optimize.




Lemma 10.3.1

For all distributions, ∃ an optimal instantaneous code (i.e., minimalexpected length) simultaneously satisfying:

1 if pj > pk then lj ≤ lk (i.e., the more probable symbol does nothave a longer length)

2 The two longest codewords have the same length

3 Two longest codewords differ only in last bit and correspond to thetwo least likely symbols.

Proof.

Suppose Cm is optimal code (so L(Cm) is minimum) and choosej, k such that pj > pk. Need to show ∃ code with lj ≤ lk.

Consider C ′m with codewords j and k swapped meaning

`′j = `k and `′k = `j (10.3)

which can only make the code longer, so L(C ′m) ≥ L(Cm) . . .Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F28/54(pg.74/220)



. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0

≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸

>0

(`k − `j)︸︷︷︸

⇒≥0

≥ 0

(10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .






0 ≤ L(C ′m)− L(Cm)

=∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸

>0

(`k − `j)︸︷︷︸

⇒≥0

≥ 0

(10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸

>0

(`k − `j)︸︷︷︸

⇒≥0

≥ 0

(10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸

>0

(`k − `j)︸︷︷︸

⇒≥0

≥ 0

(10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸

>0

(`k − `j)︸︷︷︸

⇒≥0

≥ 0

(10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸

>0

(`k − `j)︸︷︷︸

⇒≥0

≥ 0

(10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸

>0

(`k − `j)︸︷︷︸

⇒≥0

≥ 0

(10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸

>0

(`k − `j)︸︷︷︸

⇒≥0

≥ 0 (10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸>0

(`k − `j)︸︷︷︸

⇒≥0

≥ 0 (10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸>0

(`k − `j)︸︷︷︸⇒≥0

≥ 0 (10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸>0

(`k − `j)︸︷︷︸⇒≥0

≥ 0 (10.8)



. . .






0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

piì (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸︷︷︸>0

(`k − `j)︸︷︷︸⇒≥0

≥ 0 (10.8)


In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem). . . .





Property 2 (longest codewords have the same length).

If two longest codewords are not the same length, then delete thelast bit of the longer one.

⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.

if siblings after deletion if not siblings after deletion

......

⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length.

. . .






If two longest codewords are not the same length, then delete thelast bit of the longer one.

⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.


......


. . .






If two longest codewords are not the same length, then delete thelast bit of the longer one. ⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.


......


. . .








......


. . .








......

⇒ we have reduced expected length.

⇒ optimal code must havetwo longest codewords with the same length.

. . .








......

⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length. . . .





Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).

Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.

Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.

If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:

pm

pm−1

pm

pm−1 di�ers onlyin last bit

. . .









pm

pm−1

pm


. . .









pm

pm−1

pm


. . .









pm

pm−1

pm


. . .









pm

pm−1

pm

pm−1 di�ers onlyin last bit . . .





This does not change the length L =∑

i piì.

Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.

So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.

We’ll continue doing this until the optimal code will be apparent.

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )






i piì.











i piì.











i piì.











i piì.











i piì.










Indices m,m− 1 have the least probability and longest codewords.Cm length symb. prob

w1 `1 p1

w2 `2 p2...

......

wm−2 `m−2 pm−2

wm−1 `m−1 pm−1

wm `m pm

Huffman builds the code backwards, taking the two smallestprobabilities pm−1, pm, giving a bit (0 or 1) to each code word, andmerges passing the result back to another round of Huffman.






w1 `1 p1

w2 `2 p2...

......

wm−2 `m−2 pm−2

wm−1 `m−1 pm−1

wm `m pm







w1 `1 p1

w2 `2 p2...

......

wm−2 `m−2 pm−2

wm−1 `m−1 pm−1

wm `m pm





Huffman implicitly goes from current code Cm to Cm−1 as follows:

symb. prob. Cm−1 m− 1 len. code rel. length relationship symb. prob

p1 ω′1 `′1 w1 = w′1 `1 = `′1 p1p2 ω′2 `′2 w2 = w′2 `2 = `′2 p2...

......

......

...

pm−2 ω′m−2 `′m−2 wm−2 = w′m−2 `m−2 = `′m−2 pm−2

pm−1 + pm ω′m−1 `′m−1 wm−1 = w′m−10 `m−1 = `′m−1 + 1 pm−1

wm = w′m−11 `m = `′m−1 + 1 pm

Again, ωi are the Cm lengths and ω′i are the Cm−1 lengths.

Lengths are defined recursively at the time of the Huffman step. AllHuffman knows is the relationship between the current lengths andcodewords (at step m) to the next lengths and codewords (at stepm− 1). Huffman is lazy in this way.






p1 ω′1 `′1 w1 = w′1 `1 = `′1 p1p2 ω′2 `′2 w2 = w′2 `2 = `′2 p2...

......

......

...

pm−2 ω′m−2 `′m−2 wm−2 = w′m−2 `m−2 = `′m−2 pm−2


wm = w′m−11 `m = `′m−1 + 1 pm








p1 ω′1 `′1 w1 = w′1 `1 = `′1 p1p2 ω′2 `′2 w2 = w′2 `2 = `′2 p2...

......

......

...

pm−2 ω′m−2 `′m−2 wm−2 = w′m−2 `m−2 = `′m−2 pm−2


wm = w′m−11 `m = `′m−1 + 1 pm






We get the following:

L(Cm) =∑i

piì (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸︷︷︸

doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.





L(Cm)

=∑i

piì (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸︷︷︸


(10.13)






L(Cm) =∑i

piì (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸︷︷︸


(10.13)






L(Cm) =∑i

piì (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸︷︷︸


(10.13)






L(Cm) =∑i

piì (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸︷︷︸


(10.13)






L(Cm) =∑i

piì (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=

m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸︷︷︸


(10.13)






L(Cm) =∑i

piì (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=

m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸︷︷︸


(10.13)






L(Cm) =∑i

piì (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=

m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸︷︷︸doesn’t involve lengths

(10.13)






L(Cm) =∑i

piì (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=

m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸︷︷︸doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.118/220)



So the Huffman procedure implies that:

min`1:m

L(Cm) = const. + min`1:m−1

L(Cm−1) = . . . (10.14)

= const. + min`1:2

L(C2) (10.15)

where each min step is Huffman, and each preserves the statedproperties.

This reduces down to a length-2 code, which is obvious to optimize(use one bit for each source symbol), and then we backtrack toconstruct the code.

Optimality is preserved at each backtrack step. We kept theproperties of the code, and reduced the problem to one having onlyone (obvious) solution.

Theorem 10.3.2

The Huffman coding procedure is an optimal integer code lengths code.





min`1:m


L(Cm−1) = . . . (10.14)

= const. + min`1:2

L(C2) (10.15)




Theorem 10.3.2






min`1:m


L(Cm−1) = . . . (10.14)

= const. + min`1:2

L(C2) (10.15)




Theorem 10.3.2






min`1:m


L(Cm−1) = . . . (10.14)

= const. + min`1:2

L(C2) (10.15)




Theorem 10.3.2




Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal?

But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n).

No problem, right?



Huffman Codes


Is Huffman optimal?

But what does optimal mean?






No problem, right?



Huffman Codes


Is Huffman optimal? But what does optimal mean?






No problem, right?



Huffman Codes








No problem, right?



Huffman Codes








No problem, right?



Huffman Codes








No problem, right?



Huffman Codes








No problem, right?



Huffman Codes








No problem, right?



Huffman Codes







In practice, this means we need to store and be able to computep(x1:n). No problem, right?



Huffman Codes

Can we easily compute p(x1:n)?

If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.

Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).

Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine?

“dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found

Smoothing models are required. Similar to the language modelproblem in natural language processing.



Huffman Codes









Huffman Codes









Huffman Codes









Huffman Codes




Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine? “dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.

On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found




Huffman Codes




Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine? “dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found




Huffman Codes




Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine? “dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found




Huffman Codes

Huffman has the property that

H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)

Bigger block sizes help, but we get

H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)

for the block.

If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.

If block gets too long, we have the estimation problem again (hardto compute p(x1:n),

also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).



Huffman Codes


H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)



for the block.






Huffman Codes


H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)



for the block.






Huffman Codes


H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)



for the block.






Huffman Codes


H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)



for the block.






Shannon/Fano/Elias Coding

There are other good symbol coding schemes as well.

In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words

Understanding this will be useful to understand arithmetic coding.

Again, in this case, we have full access to p(x).

X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).




































Define F (x) =∑

a≤x p(a)

1 2 3 4

p(1)

p(2)

p(3)

p(4)

...




Define F (x) ,∑a<x

p(a) +1

2p(x) (10.18)

= F (x)− 1

2p(x) (10.19)

1 2 3 4

p(1)

p(2)

p(3)

p(4)

...F (x)

F (x)

F (x) is the point between F (x− 1) and F (x) so since p(x) > 0,

F (x− 1) < F (x) < F (x) (10.20)




Define F (x) ,∑a<x

p(a) +1

2p(x) (10.18)

= F (x)− 1

2p(x) (10.19)

1 2 3 4

p(1)

p(2)

p(3)

p(4)

...F (x)

F (x)

F (x) is the point between F (x− 1) and F (x) so since p(x) > 0,

F (x− 1) < F (x) < F (x) (10.20)




Since p(x) > 0, a 6= b

⇒ F (a) 6= F (b)⇔ F (a) 6= F (b)

.

So we can use F (a) as a non-singular code for a

(binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why?

But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).

E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)




Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)

⇔ F (a) 6= F (b)

.








F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)




Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).








F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)












F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)





So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)






F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)











F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)






code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.




F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)










F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)







Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x). E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.


F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)









F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)









F (x)− bF (x)c`(x) <1

2`(x)(10.21)



=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)




If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)





1

2`(x)

=1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)





1

2`(x)=

1

22−dlog 1/p(x)e

≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)





1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x)

=p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)





1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)





1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1) (10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)





1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1) (10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)





1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1) (10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)





1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1) (10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1) (10.26)




This gives

F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)

And thus bF (x)c`(x) bits is sufficient to use to describe xunambiguously if we use `(x) = dlog 1/p(x)e+ 1 bits.

Is this code prefix free?

Consider codeword z1z2 . . . z` to correspond to half-open interval

[ bF (x)c`(x)︷︸︸︷0.z1z2z3 . . . z`,

bF (x)c`(x)︷︸︸︷0.z1z2 . . . z1 +1/2`

)(10.28)

=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)

+0.00 . . . 1)

(10.30)

which has length 1/2` (all bin. numbers that start with 0.z1z2 . . . z`).




This gives

F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)




[ bF (x)c`(x)︷︸︸︷0.z1z2z3 . . . z`,

bF (x)c`(x)︷︸︸︷0.z1z2 . . . z1 +1/2`

)(10.28)

=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)

+0.00 . . . 1)

(10.30)





This gives

F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)




[ bF (x)c`(x)︷︸︸︷0.z1z2z3 . . . z`,

bF (x)c`(x)︷︸︸︷0.z1z2 . . . z1 +1/2`

)(10.28)

=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)

+0.00 . . . 1)

(10.30)





This gives

F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)




[ bF (x)c`(x)︷︸︸︷0.z1z2z3 . . . z`,

bF (x)c`(x)︷︸︸︷0.z1z2 . . . z1 +1/2`

)(10.28)

=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)

+0.00 . . . 1)

(10.30)





Viewing F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) and interval with

length 1/2`

F (x− 1) F (x) F (x)

�F (x)��(x)

possible values of the truncation

1/2�half-open interval

That is, bF (x)c`(x) ∈ (F (x− 1), F (x)], lives in the half openinterval.

But 2−`(x) ≤ p(x)/2 and F (x− 1) < bF (x)c`(x) ≤ F (x), so theopen intervals are disjoint even if bF (x)c`(x) = F (x).

thus, we have a prefix free code (i.e., if bF (x)c`(x) was a prefix ofanother codeword, that codeword would live in bF (x)c`(x)’s interval,but no other codeword does such a thing)





length 1/2`

F (x− 1) F (x) F (x)

�F (x)��(x)










length 1/2`

F (x− 1) F (x) F (x)

�F (x)��(x)










length 1/2`

F (x− 1) F (x) F (x)

�F (x)��(x)









So `(x) = dlog 1/p(x)e+ 1 suffices, and we have expected length

L =∑x

p(x)`(x) =∑x

p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2

(10.31)

Ex: dyadicx p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.5 0.75 0.5 0.10 2 103 0.125 0.875 0.8125 0.1101 4 11014 0.125 1.0 0.9375 0.111 4 1111

E` = 2.75 bits, while H = 1.75 bits.

On the other hand, Huffman achieves the entropy rate: Huffmantree (((3,4),1),2) getsE`huffman = 0.25× 1 + 0.25× 2 + 0.125× 3 + 0.125× 3 = 1.75





L =∑x

p(x)`(x) =∑x

p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2

(10.31)








L =∑x

p(x)`(x) =∑x

p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2

(10.31)








L =∑x

p(x)`(x) =∑x

p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2

(10.31)







Ex: non-dyadic. Repeated digits: e.g., let 0.01010101 = 0.01x p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.25 0.5 0.375 0.011 3 0113 0.2 0.7 0.6 0.10011 4 10014 0.15 0.85 0.775 0.1100011 4 11005 0.15 1 0.925 0.1110110 4 1110

Again, not optimal H = 2.285, E` = 3.5, while E`huffman = 2.3,with Huffman tree ((1,(4,5)),(3,2))




Ex: non-dyadic. Repeated digits: e.g., let 0.01010101 = 0.01x p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.25 0.5 0.375 0.011 3 0113 0.2 0.7 0.6 0.10011 4 10014 0.15 0.85 0.775 0.1100011 4 11005 0.15 1 0.925 0.1110110 4 1110

Again, not optimal H = 2.285, E` = 3.5, while E`huffman = 2.3,with Huffman tree ((1,(4,5)),(3,2))



Competitive optimality of Shannon code

On a particular code word, sometimes Shannon length is better thanHuffman and sometimes not (of course, on average Huffman isbetter).

Q: How likely is it that any other uniquely decodable code is shorterthan Shannon code for a particular codeword (we do Shannon onlybecause it relatively easy to analyze unlike Huffman lengths whichare define algorithmically and are thus harder to bound).

Theorem 10.4.1

Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code.

Then

Pr(`(X) ≥ `′(X) + c

)≤ 1

2c−1(10.32)






Theorem 10.4.1


Then

Pr(`(X) ≥ `′(X) + c

)≤ 1

2c−1(10.32)






Theorem 10.4.1


Then

Pr(`(X) ≥ `′(X) + c

)≤ 1

2c−1(10.32)






Theorem 10.4.1

Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code. Then

Pr(`(X) ≥ `′(X) + c

)≤ 1

2c−1(10.32)




proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)

= Pr(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.





Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)



(10.39)






Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)



(10.39)






Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)



(10.39)






Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)



(10.39)






Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)



(10.39)






Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)



(10.39)






Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)



(10.39)






Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)



(10.39)

thus, no code does better than Shannon most of the time.Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.198/220)


Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that

`(x) < `′(x) more often than `(x) > `′(x) (10.40)

where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.

Q: can this be true for all distributions?

A: No, since Huffman isbetter.

Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.

In fact, we have

Theorem 10.4.2

For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then

Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)

with equality iff `′(x) = `(x) ∀x.





where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions?

A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.

In fact, we have

Theorem 10.4.2


Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)






where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.

Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.

In fact, we have

Theorem 10.4.2


Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)






where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.

In fact, we have

Theorem 10.4.2


Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)






where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer. In fact, we have

Theorem 10.4.2


Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)






where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer. In fact, we have

Theorem 10.4.2


Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)




Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.

Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . .

This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)




Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . .

This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)




Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)




Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0


Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)




Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0


Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)




Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0


Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)




Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0


Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)




Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0


Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)




. . . proof of Theorem 10.4.2.

But since p(x) is dyadic, p(x) = 2−`(x), we get

=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)

≤ 1− 1 since `′(x) satisfies Kraft (10.50)

= 0

(10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.






=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)


= 0

(10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.






=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)


= 0

(10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.






=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)


= 0

(10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.






=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)


= 0 (10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.






=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)


= 0 (10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.



Next time

Shannon games and stream codes



Next time

Shannon games and stream codes


Date post:	02-Mar-2022
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

EE514a Information Theory I Fall Quarter 2019

Documents