+ All Categories
Home > Documents > EE514a Information Theory I Fall Quarter 2019

EE514a Information Theory I Fall Quarter 2019

Date post: 02-Mar-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
220
EE514a – Information Theory I Fall Quarter 2019 Prof. Jeff Bilmes University of Washington, Seattle Department of Electrical & Computer Engineering Fall Quarter, 2019 https://class.ece.uw.edu/514/bilmes/ee514_fall_2019/ Lecture 10 - Oct 30th, 2019 Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F1/54 (pg.1/220)
Transcript
Page 1: EE514a Information Theory I Fall Quarter 2019

EE514a – Information Theory IFall Quarter 2019

Prof. Jeff Bilmes

University of Washington, SeattleDepartment of Electrical & Computer Engineering

Fall Quarter, 2019https://class.ece.uw.edu/514/bilmes/ee514_fall_2019/

Lecture 10 - Oct 30th, 2019

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F1/54(pg.1/220)

Page 2: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Class Road Map - IT-I

L1 (9/25): Overview, Communications,Information, Entropy

L2 (9/30): Entropy, Mutual Information,KL-Divergence

L3 (10/2): More KL, Jensen, more Venn,Log Sum,Data Proc. Inequality

L4 (10/7): Data Proc. Ineq.,thermodynamics, Stats, Fano,

L5 (10/9): M. of Conv, AEP,

L6 (10/14): AEP, Source Coding, Types

LX (10/16): Makeup

L7 (10/21): Types, Univ. Src Coding,Stoc. Procs, Entropy Rates

L8 (10/23): Entropy rates, HMMs,Coding

L9 (10/28): Kraft ineq., Shannon Codes,Kraft ineq. II, Huffman

L10 (10/30): Huffman,Shannon/Fano/Elias

L11 (11/4):

LXX (11/6): In class midterm exam

L12 (11/11): Veterans Day (Makeuplecture)

L13 (11/13):

L14 (11/18):

L15 (11/20):

L16 (11/25):

L17 (11/27):

L18 (12/2):

L19 (12/4):

LXX (12/10): Final exam

Finals Week: December 9th–13th.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F2/54(pg.2/220)

Page 3: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Cumulative Outstanding Reading

Read chapters 1 and 2 in our book (Cover & Thomas, “InformationTheory”) (including Fano’s inequality).

Read chapters 3 and 4 in our book (Cover & Thomas, “InformationTheory”).

Read sections 11.1 through 11.3 in our book (Cover & Thomas,“Information Theory”).

Read chapter 4 in our book (Cover & Thomas, “InformationTheory”).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F3/54(pg.3/220)

Page 4: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Homework

Homework 1 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), wasdue Tuesday, Oct 8th, 11:55pm.

Homework 2 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), dueFriday 10/18/2019, 11:45pm.

Homework 3 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), dueTuesday 10/29/2019, 11:45pm.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F4/54(pg.4/220)

Page 5: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Kraft inequality

Theorem 10.2.1 (Kraft inequality)

For any instantaneous code (prefix code) over alphabet of size D, thecodeword lengths `1, `2, . . . , `m must satisfy∑

i

D−`i ≤ 1 (10.1)

Conversely, given a set of codeword lengths satisfying the aboveinequality, ∃ an instantaneous code with these word lengths.

Note: converse says there exists a code with these lengths, not thatall codes with these lengths will satisfy the inequality.Key point: for `i satisfying Kraft, no further restriction imposed byalso wanting a prefix code, so we might as well use a prefix code(assuming it is easy to find given the lengths)Connects code existence to mathematical property on lengths!Given Kraft lengths, can construct an instantaneous code (as we willsee). Given lengths, can compute E[`] and compare with H.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F5/54(pg.5/220)

Page 6: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Towards Optimal Codes

Summarizing: Prefix code ⇔ Kraft inequality.

Thus, we need only find lengths that satisfy Kraft to find a prefixcode.

Goal: find a prefix code with minimum expected length

L(C) =∑i

pi`i (10.5)

This is a constrained optimization problem:

minimize{`1:m}∈Zm

++

∑i

pi`i (10.6)

subject to∑i

D−`i ≤ 1

Integer program is an NP-complete optimization, not likely to beefficiently solvable (unless P=NP).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F6/54(pg.6/220)

Page 7: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Towards Optimal Codes

Relax the integer constraints on `i for now, and consider Lagrangian

J =∑i

pi`i + λ(∑i

D−`i − 1) (10.5)

Taking derivatives and setting to 0,

∂J

∂`i= pi − λD−`i lnD = 0 (10.6)

⇒ D−`i =pi

λ lnD(10.7)

∂J

∂λ=∑i

D−`i − 1 = 0 ⇒ λ = 1/ lnD (10.8)

⇒ D−`i = pi yielding `∗i = − logD pi (10.9)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F7/54(pg.7/220)

Page 8: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Optimal Code Lengths

Theorem 10.2.2

Entropy is the minimum expected length. That is, the expected length Lof any instantaneous D-ary code (which thus satisfies Kraft inequality)for a r.v. X is such that

L ≥ HD(X) (10.6)

with equality iff D−`i = pi.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F8/54(pg.8/220)

Page 9: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Optimal Code Lengths

. . . Proof of Theorem ??.

So we have that L ≥ HD(X).

Equality, L = H is achieved iff pi = D−`i for all i ⇔ − logD pi is aninteger . . .

. . . in which case c =∑

iD−`i = 1

Definition 10.2.2 (D-adic)

A probability distribution is called D-adic w.r.t. D if each of theprobabilities is = D−n for some n.

Ex: when D = 2, the distribution [12 ,

14 ,

18 ,

18 ] = [2−1, 2−2, 2−3, 2−3]

is 2-adic.

Thus, we have equality above iff the distribution is appropriatelyD-adic.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F9/54(pg.9/220)

Page 10: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Shannon Codes

L−H = D(p||r) + logD 1/c, with c =∑

iD−`i

Thus, to produce a code, we find closest (in the KL sense) D-adicdistribution w.r.t. D to p and then construct the code as in theproof of the Kraft inequality converse.

In general, however, unless P=NP, it is hard to find the KL closestD-adic distribution (integer programming problem).

Shannon codes: consider `i = dlogD 1/pie as the code lengths∑iD−`i =

∑iD−dlog 1/pie ≤∑iD

− log 1/pi =∑

i pi = 1

This means Kraft inequality holds for these lengths, so there is aprefix code (if the lengths were too short there might be a problembut we’re rounding up).

Also, we have a bound on lengths in terms of real numbers

logD1

pi≤ `i < logD

1

pi+ 1 (10.12)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F10/54(pg.10/220)

Page 11: EE514a Information Theory I Fall Quarter 2019

Logistics Review

How bad is one bit?

How bad is this overhead?

Depends on H. Efficiency of code

0 ≤ Efficiency ,HD(X)

E`(X)≤ 1 (10.14)

If E`(X) = HD(X) + 1, then efficiency → 1 as H(X)→∞.

efficiency → 0 as H(X)→ 0, so entropy would need to be verylarge for this to be good.

For small alphabets (or low-entropy distributions, such as close todeterministic distributions), impossible to have good efficiency. E.g.,D = {0, 1} then maxH(X) = 1, so best possible efficiency is 50%/.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F11/54(pg.11/220)

Page 12: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Improving efficiency

Such symbol codes are inherently disadvantaged, unless theirdistributions are D-adic.

We can reduce overhead (improve efficiency) by coding > 1 symbolat a time (block code, or a vector code, the symbol is the vector).

Let Ln be the expected per-symbol length for n symbols x1:n. Ln isthe expected per-symbol length, when encoding n symbols.

Ln =1

n

∑x1:n

p(x1:n)`(x1:n) =1

nE`(x1:n) (10.14)

Lets use Shannon coding lengths to get∑i

pi

(log 1/pi ≤ `i ≤ log 1/pi + 1

)(10.15)

⇒ H(X1, . . . , Xn) ≤ E`(X1:n) < H(X1, . . . , Xn) + 1 (10.16)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F12/54(pg.12/220)

Page 13: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Coding with the wrong distribution

Theorem 10.2.4

Expected length under p(x)of code with `(x) = dlog 1/q(x)e satisfies

H(p) +D(p||q) ≤ Ep`(X) ≤ H(p) +D(p||q) + 1 (10.22)

l.h.s. is the best we can do with the wrong distribution q when thetrue distribution is p.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F13/54(pg.13/220)

Page 14: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Kraft revisited

We proved Kraft inequality is true for instantaneous codes (and viceverse).Could it be true for all uniquely decodable codes?Could larger class of codes have shorter expected codeword lengths?Since larger, we might (naıvely) expect that we could do better.

Theorem 10.2.4

Codeword lengths of any uniquely decodable code (not. nec.instantaneous) must satisfy Kraft inequality

∑iD−`i ≤ 1. Conversely,

given a set of codeword lengths that satisfy Kraft, it is possible toconstruct a uniquely decodable code.

Proof.

Proof converse we already saw before (given lengths, we can construct aprefix code which is thus uniquely decodable). Thus we only need provethe first part. . . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F14/54(pg.14/220)

Page 15: EE514a Information Theory I Fall Quarter 2019

Logistics Review

Huffman coding

A procedure for finding shortest expected length prefix code.

You’ve probably encountered it in computer science classes (a classicalgorithm).

Here we analyze it armed with the tools of information theory.

Quest: given a p(x), find a code (bit strings and set of lengths) thatis as short as possible, and also an instantaneous code (prefix free).

We could do this greedily: start at the top and split the potentialcodewords into even probabilities (i.e., asking the question withhighest entropy)

This is similar to the game of 20 questions. We have a set ofobjects, w.l.o.g. the set S = {1, 2, 3, 4, . . . ,m} that occur withfrequency proportional to non-negative (w1, w2, . . . , wm).

We wish to determine an object from this class asking as fewquestions as possible.

Supposing X ∈ S, each question can take the form “Is X ∈ A?” forsome A ⊆ S.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F15/54(pg.15/220)

Page 16: EE514a Information Theory I Fall Quarter 2019

Logistics Review

20 Questions

Question tree. S = {x1, x2, x3, x4, x5}.

X ∈ {x2, x3}

X ∈ {x2

x2

x3

x1

x4

x5

}

X ∈ {x1}

X ∈ {x4}

0.2

0.2

0.30.15

0.15

Y

Y

Y

YN

N

NN

How do we construct such a tree? Charles Sanders Peirce, 1901 said:Thus twenty skillful hypotheses will ascertain what two hundredthousand stupid ones might fail to do. The secret of the businesslies in the caution which breaks a hypothesis up into its smallestlogical components, and only risks one of them at a time.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F16/54(pg.16/220)

Page 17: EE514a Information Theory I Fall Quarter 2019

Logistics Review

The Greedy Method for Finding a Code

Suggests a greedy method. “Do next whatever currently looks best.”

Consider following table:a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

The question that looks best would infer the most about thedistribution, one with the largest entropy.

H(X|Y1) = H(X,Y1)−H(Y1) = H(X)−H(Y1), so choosing aquestion Y1 with large entropy leads to least “residual” uncertaintyH(X|Y1) about X.

Identically, we choose the question Y1 with the greatest mutualinformation about X since in this caseI(Y1;X) = H(X)−H(X|Y1) = H(Y1).

Again, questions take the form “Is X ∈ A?” for some A ⊆ S, sochoosing a yes/no (binary) question means choosing the set A.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F17/54(pg.17/220)

Page 18: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Method

We’ll use greedy, and choose the question (set) with the greatestentropy.

If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.

This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).

Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.

Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.18/220)

Page 19: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Method

We’ll use greedy, and choose the question (set) with the greatestentropy.

If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.

This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).

Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.

Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.19/220)

Page 20: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Method

We’ll use greedy, and choose the question (set) with the greatestentropy.

If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.

This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).

Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.

Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.20/220)

Page 21: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Method

We’ll use greedy, and choose the question (set) with the greatestentropy.

If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.

This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).

Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.

Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.21/220)

Page 22: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Method

We’ll use greedy, and choose the question (set) with the greatestentropy.

If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.

This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).

Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.

Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.22/220)

Page 23: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.

This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).

If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:

case I II III

split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)

H(Y2|Y1 = 1) 0.3274 0.2423 0.1414

Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274

Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.23/220)

Page 24: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.

This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).

If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:

case I II III

split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)

H(Y2|Y1 = 1) 0.3274 0.2423 0.1414

Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274

Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.24/220)

Page 25: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.

This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).

If Y1 = 1, then we need to partition the set {e, f, g}.

We can do this in one of three ways:case I II III

split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)

H(Y2|Y1 = 1) 0.3274 0.2423 0.1414

Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274

Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.25/220)

Page 26: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.

This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).

If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:

case I II III

split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)

H(Y2|Y1 = 1) 0.3274 0.2423 0.1414

Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274

Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.26/220)

Page 27: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.

This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).

If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:

case I II III

split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)

H(Y2|Y1 = 1) 0.3274 0.2423 0.1414

Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274

Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.27/220)

Page 28: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.

This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).

If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:

case I II III

split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)

H(Y2|Y1 = 1) 0.3274 0.2423 0.1414

Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274

Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.28/220)

Page 29: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.

Summarizing all questions/splits, & their conditional entropies:set split probabilities conditional entropy

{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183

Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall

H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.29/220)

Page 30: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.

Summarizing all questions/splits, & their conditional entropies:

set split probabilities conditional entropy{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183

Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall

H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.30/220)

Page 31: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.

Summarizing all questions/splits, & their conditional entropies:

set split probabilities conditional entropy{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183

Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall

H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.31/220)

Page 32: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.

Summarizing all questions/splits, & their conditional entropies:set split probabilities conditional entropy

{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183

Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall

H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.32/220)

Page 33: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.

Summarizing all questions/splits, & their conditional entropies:set split probabilities conditional entropy

{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183

Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall

H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.33/220)

Page 34: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.

Summarizing all questions/splits, & their conditional entropies:set split probabilities conditional entropy

{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183

Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall

H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)

= H(Y1) +∑

i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)

+∑

i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.34/220)

Page 35: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

This leads to the following (top-down greedily constructed) tree:

{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03

The expected length of this code E` = 2.5300.

Entropy: H = 1.9323.

Code efficiency H/E` = 1.9323/2.5300 = 0.7638.

Can we do better?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.35/220)

Page 36: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03

The expected length of this code E` = 2.5300.

Entropy: H = 1.9323.

Code efficiency H/E` = 1.9323/2.5300 = 0.7638.

Can we do better?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.36/220)

Page 37: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03

The expected length of this code E` = 2.5300.

Entropy: H = 1.9323.

Code efficiency H/E` = 1.9323/2.5300 = 0.7638.

Can we do better?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.37/220)

Page 38: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03

The expected length of this code E` = 2.5300.

Entropy: H = 1.9323.

Code efficiency H/E` = 1.9323/2.5300 = 0.7638.

Can we do better?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.38/220)

Page 39: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03

The expected length of this code E` = 2.5300.

Entropy: H = 1.9323.

Code efficiency H/E` = 1.9323/2.5300 = 0.7638.

Can we do better?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.39/220)

Page 40: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree

a b c d e f g

p 0.01 0.24 0.05 0.20 0.47 0.01 0.02

This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000 001 010 011 110 111

10

0 0

0

1

1

11 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.03

The expected length of this code E` = 2.5300.

Entropy: H = 1.9323.

Code efficiency H/E` = 1.9323/2.5300 = 0.7638.

Can we do better?Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.40/220)

Page 41: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman

{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?

f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1

The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.41/220)

Page 42: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman

{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1

The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.42/220)

Page 43: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman

{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1

The Huffman lengths have E`huffman = 1.9700

Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.43/220)

Page 44: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman

{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1

The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809

Key problem: Greedy procedure is not optimal in this case.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.44/220)

Page 45: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman

{a, b, c, d, e, f, g}

X ∈ {a, b, c, d}?

dc

X∈{c, d}?

a b

X∈ {

a,b}?

{e, f, g}X ∈ ?

e{e}

X∈

?f g

{f, g}

X∈

?

0

0

0

000

000000 000001

00001

0001

001

01

1

001 010 011 110 111

10

0 0

0

1

1

1

0 1

0 1

0 1

0 1

0 1

1 1

1

0.5 0.5

0.25

0.01 0.24 0.05 0.2 0.01 0.02

0.25

0.47

0.470.53

0.240.29

0.200.09

0.050.04

0.020.02

0.010.01

0.03

a f

g

c

d

b

e

0 1

The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.45/220)

Page 46: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Greedy

Why does starting from the top and splitting as such non-optimal?Where can it go wrong?

Ex: There may be many ways to get a ≈ 50% split (to achieve highentropy) once done, the split is irrevocable and there is no way toknow if the consequences of that split might hurt down the line.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F23/54(pg.46/220)

Page 47: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Greedy

Why does starting from the top and splitting as such non-optimal?Where can it go wrong?

Ex: There may be many ways to get a ≈ 50% split (to achieve highentropy) once done, the split is irrevocable and there is no way toknow if the consequences of that split might hurt down the line.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F23/54(pg.47/220)

Page 48: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

The Huffman code tree procedure

1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,

and will differ in the last digit.3 Combine these two symbols into a joint symbol having probability

equal to the sum, add the joint symbol and then remove the twosymbols, and repeat.

Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.48/220)

Page 49: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

The Huffman code tree procedure1 take the two least probable symbols in the alphabet.

2 These two will be given the longest codewords, will have equal length,and will differ in the last digit.

3 Combine these two symbols into a joint symbol having probabilityequal to the sum, add the joint symbol and then remove the twosymbols, and repeat.

Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.49/220)

Page 50: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

The Huffman code tree procedure1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,

and will differ in the last digit.

3 Combine these two symbols into a joint symbol having probabilityequal to the sum, add the joint symbol and then remove the twosymbols, and repeat.

Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.50/220)

Page 51: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

The Huffman code tree procedure1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,

and will differ in the last digit.3 Combine these two symbols into a joint symbol having probability

equal to the sum, add the joint symbol and then remove the twosymbols, and repeat.

Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.51/220)

Page 52: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

The Huffman code tree procedure1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,

and will differ in the last digit.3 Combine these two symbols into a joint symbol having probability

equal to the sum, add the joint symbol and then remove the twosymbols, and repeat.

Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.52/220)

Page 53: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.

So 4 and 5 should have longest code length

We build the tree from left to right.

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.53/220)

Page 54: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length

We build the tree from left to right.

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.54/220)

Page 55: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length

We build the tree from left to right.

X12345

0.250.250.20.150.15

probcodeword

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.55/220)

Page 56: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length

We build the tree from left to right.

X12345

0.250.250.20.150.15

probcodeword

0.250.250.20.3

probstep 1

01

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.56/220)

Page 57: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length

We build the tree from left to right.

X12345

0.250.250.20.150.15

probcodeword

0.250.250.20.3

prob

0.250.45

0.3

probstep 1 step 2

01

01

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.57/220)

Page 58: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length

We build the tree from left to right.

X12345

0.250.250.20.150.15

probcodeword

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

probstep 1 step 2 step 3

01

01

0

1

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.58/220)

Page 59: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length

We build the tree from left to right.

X12345

0.250.250.20.150.15

probcodeword

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

prob

1.0probstep 1 step 2 step 3 step 4

01

01

01

0

1

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.59/220)

Page 60: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length

We build the tree from left to right.

Xlog1

p(x)

12345

22233

2.02.02.32.72.7

001011010011

0.250.250.20.150.15

probcodewordlength

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

prob

1.0probstep 1 step 2 step 3 step 4

01

01

01

0

1

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.60/220)

Page 61: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length

We build the tree from left to right.

Xlog1

p(x)

12345

22233

2.02.02.32.72.7

001011010011

0.250.250.20.150.15

probcodewordlength

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

prob

1.0probstep 1 step 2 step 3 step 4

01

01

01

0

1

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.61/220)

Page 62: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman

Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length

We build the tree from left to right.

Xlog1

p(x)

12345

22233

2.02.02.32.72.7

001011010011

0.250.250.20.150.15

probcodewordlength

0.250.250.20.3

prob

0.250.45

0.3

prob

0.550.45

prob

1.0probstep 1 step 2 step 3 step 4

01

01

01

0

1

So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).

Some code lengths are shorter/longer than I(x) = log 1/p(x).

Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.62/220)

Page 63: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

More Huffman vs. Shannon

Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.

Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.

Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).

But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.

In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.63/220)

Page 64: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

More Huffman vs. Shannon

Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.

Optimal code lengths are not always ≤ dlog 1/pie.

Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.

Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).

But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.

In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.64/220)

Page 65: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

More Huffman vs. Shannon

Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.

Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.

Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).

But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.

In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.65/220)

Page 66: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

More Huffman vs. Shannon

Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.

Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.

Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).

But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.

In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.66/220)

Page 67: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

More Huffman vs. Shannon

Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.

Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.

Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).

But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.

In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.67/220)

Page 68: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

More Huffman vs. Shannon

Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.

Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.

Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).

But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.

In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.68/220)

Page 69: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Huffman is optimal, i.e.,∑

i pi`i is minimal, for integer lengths.

To show this:

1 First show lemma that some optimal codes have certain properties(not all, but that ∃ optimal code with these properties).

2 Given a code Cm for m symbols, that has said properties, producenew simpler code satisfying lemma and is simpler to optimize.

3 Ultimately get down to simple case of two symbols which are obviousto optimize.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.69/220)

Page 70: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Huffman is optimal, i.e.,∑

i pi`i is minimal, for integer lengths.

To show this:

1 First show lemma that some optimal codes have certain properties(not all, but that ∃ optimal code with these properties).

2 Given a code Cm for m symbols, that has said properties, producenew simpler code satisfying lemma and is simpler to optimize.

3 Ultimately get down to simple case of two symbols which are obviousto optimize.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.70/220)

Page 71: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Huffman is optimal, i.e.,∑

i pi`i is minimal, for integer lengths.

To show this:1 First show lemma that some optimal codes have certain properties

(not all, but that ∃ optimal code with these properties).

2 Given a code Cm for m symbols, that has said properties, producenew simpler code satisfying lemma and is simpler to optimize.

3 Ultimately get down to simple case of two symbols which are obviousto optimize.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.71/220)

Page 72: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Huffman is optimal, i.e.,∑

i pi`i is minimal, for integer lengths.

To show this:1 First show lemma that some optimal codes have certain properties

(not all, but that ∃ optimal code with these properties).2 Given a code Cm for m symbols, that has said properties, produce

new simpler code satisfying lemma and is simpler to optimize.

3 Ultimately get down to simple case of two symbols which are obviousto optimize.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.72/220)

Page 73: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Huffman is optimal, i.e.,∑

i pi`i is minimal, for integer lengths.

To show this:1 First show lemma that some optimal codes have certain properties

(not all, but that ∃ optimal code with these properties).2 Given a code Cm for m symbols, that has said properties, produce

new simpler code satisfying lemma and is simpler to optimize.3 Ultimately get down to simple case of two symbols which are obvious

to optimize.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.73/220)

Page 74: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Lemma 10.3.1

For all distributions, ∃ an optimal instantaneous code (i.e., minimalexpected length) simultaneously satisfying:

1 if pj > pk then lj ≤ lk (i.e., the more probable symbol does nothave a longer length)

2 The two longest codewords have the same length

3 Two longest codewords differ only in last bit and correspond to thetwo least likely symbols.

Proof.

Suppose Cm is optimal code (so L(Cm) is minimum) and choosej, k such that pj > pk. Need to show ∃ code with lj ≤ lk.

Consider C ′m with codewords j and k swapped meaning

`′j = `k and `′k = `j (10.3)

which can only make the code longer, so L(C ′m) ≥ L(Cm) . . .Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F28/54(pg.74/220)

Page 75: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0

≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸

>0

(`k − `j)︸ ︷︷ ︸

⇒≥0

≥ 0

(10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.75/220)

Page 76: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm)

=∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸

>0

(`k − `j)︸ ︷︷ ︸

⇒≥0

≥ 0

(10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.76/220)

Page 77: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸

>0

(`k − `j)︸ ︷︷ ︸

⇒≥0

≥ 0

(10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.77/220)

Page 78: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸

>0

(`k − `j)︸ ︷︷ ︸

⇒≥0

≥ 0

(10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.78/220)

Page 79: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸

>0

(`k − `j)︸ ︷︷ ︸

⇒≥0

≥ 0

(10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.79/220)

Page 80: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸

>0

(`k − `j)︸ ︷︷ ︸

⇒≥0

≥ 0

(10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.80/220)

Page 81: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸

>0

(`k − `j)︸ ︷︷ ︸

⇒≥0

≥ 0

(10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.81/220)

Page 82: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸

>0

(`k − `j)︸ ︷︷ ︸

⇒≥0

≥ 0 (10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.82/220)

Page 83: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸>0

(`k − `j)︸ ︷︷ ︸

⇒≥0

≥ 0 (10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.83/220)

Page 84: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸>0

(`k − `j)︸ ︷︷ ︸⇒≥0

≥ 0 (10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.84/220)

Page 85: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸>0

(`k − `j)︸ ︷︷ ︸⇒≥0

≥ 0 (10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.85/220)

Page 86: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

With this swap, since L(Cm) is minimal, we have

0 ≤ L(C ′m)− L(Cm) =∑i

pi`′i −∑i

pi`i (10.4)

= pj`′j + pk`

′k − pj`j − pk`k (10.5)

= pj`k + pk`j − pj`j − pk`k (10.6)

= pj(`k − `j)− pk(`k − `j) (10.7)

= (pj − pk)︸ ︷︷ ︸>0

(`k − `j)︸ ︷︷ ︸⇒≥0

≥ 0 (10.8)

Thus, `k ≥ `j when pj > pk and the code satisfies property 1.

In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem). . . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.86/220)

Page 87: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 2 (longest codewords have the same length).

If two longest codewords are not the same length, then delete thelast bit of the longer one.

⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.

if siblings after deletion if not siblings after deletion

......

⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length.

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.87/220)

Page 88: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 2 (longest codewords have the same length).

If two longest codewords are not the same length, then delete thelast bit of the longer one.

⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.

if siblings after deletion if not siblings after deletion

......

⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length.

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.88/220)

Page 89: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 2 (longest codewords have the same length).

If two longest codewords are not the same length, then delete thelast bit of the longer one. ⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.

if siblings after deletion if not siblings after deletion

......

⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length.

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.89/220)

Page 90: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 2 (longest codewords have the same length).

If two longest codewords are not the same length, then delete thelast bit of the longer one. ⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.

if siblings after deletion if not siblings after deletion

......

⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length.

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.90/220)

Page 91: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 2 (longest codewords have the same length).

If two longest codewords are not the same length, then delete thelast bit of the longer one. ⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.

if siblings after deletion if not siblings after deletion

......

⇒ we have reduced expected length.

⇒ optimal code must havetwo longest codewords with the same length.

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.91/220)

Page 92: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 2 (longest codewords have the same length).

If two longest codewords are not the same length, then delete thelast bit of the longer one. ⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.

if siblings after deletion if not siblings after deletion

......

⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length. . . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.92/220)

Page 93: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).

Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.

Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.

If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:

pm

pm−1

pm

pm−1 di�ers onlyin last bit

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.93/220)

Page 94: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).

Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.

Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.

If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:

pm

pm−1

pm

pm−1 di�ers onlyin last bit

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.94/220)

Page 95: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).

Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.

Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.

If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:

pm

pm−1

pm

pm−1 di�ers onlyin last bit

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.95/220)

Page 96: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).

Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.

Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.

If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:

pm

pm−1

pm

pm−1 di�ers onlyin last bit

. . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.96/220)

Page 97: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).

Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.

Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.

If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:

pm

pm−1

pm

pm−1 di�ers onlyin last bit . . .

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.97/220)

Page 98: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

This does not change the length L =∑

i pi`i.

Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.

So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.

We’ll continue doing this until the optimal code will be apparent.

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.98/220)

Page 99: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

This does not change the length L =∑

i pi`i.

Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.

So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.

We’ll continue doing this until the optimal code will be apparent.

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.99/220)

Page 100: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

This does not change the length L =∑

i pi`i.

Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.

So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.

We’ll continue doing this until the optimal code will be apparent.

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.100/220)

Page 101: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

This does not change the length L =∑

i pi`i.

Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.

So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.

We’ll continue doing this until the optimal code will be apparent.

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.101/220)

Page 102: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

This does not change the length L =∑

i pi`i.

Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.

So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.

We’ll continue doing this until the optimal code will be apparent.

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.102/220)

Page 103: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

. . . proof of lemma 10.3.1.

This does not change the length L =∑

i pi`i.

Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.

So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.

We’ll continue doing this until the optimal code will be apparent.

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.103/220)

Page 104: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Indices m,m− 1 have the least probability and longest codewords.Cm length symb. prob

w1 `1 p1

w2 `2 p2...

......

wm−2 `m−2 pm−2

wm−1 `m−1 pm−1

wm `m pm

Huffman builds the code backwards, taking the two smallestprobabilities pm−1, pm, giving a bit (0 or 1) to each code word, andmerges passing the result back to another round of Huffman.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F33/54(pg.104/220)

Page 105: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Indices m,m− 1 have the least probability and longest codewords.Cm length symb. prob

w1 `1 p1

w2 `2 p2...

......

wm−2 `m−2 pm−2

wm−1 `m−1 pm−1

wm `m pm

Huffman builds the code backwards, taking the two smallestprobabilities pm−1, pm, giving a bit (0 or 1) to each code word, andmerges passing the result back to another round of Huffman.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F33/54(pg.105/220)

Page 106: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1

Indices m,m− 1 have the least probability and longest codewords.Cm length symb. prob

w1 `1 p1

w2 `2 p2...

......

wm−2 `m−2 pm−2

wm−1 `m−1 pm−1

wm `m pm

Huffman builds the code backwards, taking the two smallestprobabilities pm−1, pm, giving a bit (0 or 1) to each code word, andmerges passing the result back to another round of Huffman.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F33/54(pg.106/220)

Page 107: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Huffman implicitly goes from current code Cm to Cm−1 as follows:

symb. prob. Cm−1 m− 1 len. code rel. length relationship symb. prob

p1 ω′1 `′1 w1 = w′1 `1 = `′1 p1p2 ω′2 `′2 w2 = w′2 `2 = `′2 p2...

......

......

...

pm−2 ω′m−2 `′m−2 wm−2 = w′m−2 `m−2 = `′m−2 pm−2

pm−1 + pm ω′m−1 `′m−1 wm−1 = w′m−10 `m−1 = `′m−1 + 1 pm−1

wm = w′m−11 `m = `′m−1 + 1 pm

Again, ωi are the Cm lengths and ω′i are the Cm−1 lengths.

Lengths are defined recursively at the time of the Huffman step. AllHuffman knows is the relationship between the current lengths andcodewords (at step m) to the next lengths and codewords (at stepm− 1). Huffman is lazy in this way.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F34/54(pg.107/220)

Page 108: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Huffman implicitly goes from current code Cm to Cm−1 as follows:

symb. prob. Cm−1 m− 1 len. code rel. length relationship symb. prob

p1 ω′1 `′1 w1 = w′1 `1 = `′1 p1p2 ω′2 `′2 w2 = w′2 `2 = `′2 p2...

......

......

...

pm−2 ω′m−2 `′m−2 wm−2 = w′m−2 `m−2 = `′m−2 pm−2

pm−1 + pm ω′m−1 `′m−1 wm−1 = w′m−10 `m−1 = `′m−1 + 1 pm−1

wm = w′m−11 `m = `′m−1 + 1 pm

Again, ωi are the Cm lengths and ω′i are the Cm−1 lengths.

Lengths are defined recursively at the time of the Huffman step. AllHuffman knows is the relationship between the current lengths andcodewords (at step m) to the next lengths and codewords (at stepm− 1). Huffman is lazy in this way.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F34/54(pg.108/220)

Page 109: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

Huffman implicitly goes from current code Cm to Cm−1 as follows:

symb. prob. Cm−1 m− 1 len. code rel. length relationship symb. prob

p1 ω′1 `′1 w1 = w′1 `1 = `′1 p1p2 ω′2 `′2 w2 = w′2 `2 = `′2 p2...

......

......

...

pm−2 ω′m−2 `′m−2 wm−2 = w′m−2 `m−2 = `′m−2 pm−2

pm−1 + pm ω′m−1 `′m−1 wm−1 = w′m−10 `m−1 = `′m−1 + 1 pm−1

wm = w′m−11 `m = `′m−1 + 1 pm

Again, ωi are the Cm lengths and ω′i are the Cm−1 lengths.

Lengths are defined recursively at the time of the Huffman step. AllHuffman knows is the relationship between the current lengths andcodewords (at step m) to the next lengths and codewords (at stepm− 1). Huffman is lazy in this way.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F34/54(pg.109/220)

Page 110: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

We get the following:

L(Cm) =∑i

pi`i (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸

doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.110/220)

Page 111: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

We get the following:

L(Cm)

=∑i

pi`i (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸

doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.111/220)

Page 112: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

We get the following:

L(Cm) =∑i

pi`i (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸

doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.112/220)

Page 113: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

We get the following:

L(Cm) =∑i

pi`i (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸

doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.113/220)

Page 114: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

We get the following:

L(Cm) =∑i

pi`i (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸

doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.114/220)

Page 115: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

We get the following:

L(Cm) =∑i

pi`i (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=

m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸

doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.115/220)

Page 116: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

We get the following:

L(Cm) =∑i

pi`i (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=

m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸

doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.116/220)

Page 117: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

We get the following:

L(Cm) =∑i

pi`i (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=

m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.117/220)

Page 118: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

We get the following:

L(Cm) =∑i

pi`i (10.9)

=

m−2∑i=1

pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)

=

m−2∑i=1

pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)

=

m−1∑i=1

p′i`′i + pm−1 + pm (10.12)

= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸doesn’t involve lengths

(10.13)

Reduces num. of variables we need to optimize over.Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.118/220)

Page 119: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

So the Huffman procedure implies that:

min`1:m

L(Cm) = const. + min`1:m−1

L(Cm−1) = . . . (10.14)

= const. + min`1:2

L(C2) (10.15)

where each min step is Huffman, and each preserves the statedproperties.

This reduces down to a length-2 code, which is obvious to optimize(use one bit for each source symbol), and then we backtrack toconstruct the code.

Optimality is preserved at each backtrack step. We kept theproperties of the code, and reduced the problem to one having onlyone (obvious) solution.

Theorem 10.3.2

The Huffman coding procedure is an optimal integer code lengths code.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F36/54(pg.119/220)

Page 120: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

So the Huffman procedure implies that:

min`1:m

L(Cm) = const. + min`1:m−1

L(Cm−1) = . . . (10.14)

= const. + min`1:2

L(C2) (10.15)

where each min step is Huffman, and each preserves the statedproperties.

This reduces down to a length-2 code, which is obvious to optimize(use one bit for each source symbol), and then we backtrack toconstruct the code.

Optimality is preserved at each backtrack step. We kept theproperties of the code, and reduced the problem to one having onlyone (obvious) solution.

Theorem 10.3.2

The Huffman coding procedure is an optimal integer code lengths code.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F36/54(pg.120/220)

Page 121: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

So the Huffman procedure implies that:

min`1:m

L(Cm) = const. + min`1:m−1

L(Cm−1) = . . . (10.14)

= const. + min`1:2

L(C2) (10.15)

where each min step is Huffman, and each preserves the statedproperties.

This reduces down to a length-2 code, which is obvious to optimize(use one bit for each source symbol), and then we backtrack toconstruct the code.

Optimality is preserved at each backtrack step. We kept theproperties of the code, and reduced the problem to one having onlyone (obvious) solution.

Theorem 10.3.2

The Huffman coding procedure is an optimal integer code lengths code.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F36/54(pg.121/220)

Page 122: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Optimality of Huffman

So the Huffman procedure implies that:

min`1:m

L(Cm) = const. + min`1:m−1

L(Cm−1) = . . . (10.14)

= const. + min`1:2

L(C2) (10.15)

where each min step is Huffman, and each preserves the statedproperties.

This reduces down to a length-2 code, which is obvious to optimize(use one bit for each source symbol), and then we backtrack toconstruct the code.

Optimality is preserved at each backtrack step. We kept theproperties of the code, and reduced the problem to one having onlyone (obvious) solution.

Theorem 10.3.2

The Huffman coding procedure is an optimal integer code lengths code.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F36/54(pg.122/220)

Page 123: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal?

But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n).

No problem, right?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.123/220)

Page 124: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal?

But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n).

No problem, right?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.124/220)

Page 125: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal? But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n).

No problem, right?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.125/220)

Page 126: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal? But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n).

No problem, right?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.126/220)

Page 127: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal? But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n).

No problem, right?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.127/220)

Page 128: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal? But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n).

No problem, right?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.128/220)

Page 129: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal? But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n).

No problem, right?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.129/220)

Page 130: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal? But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n).

No problem, right?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.130/220)

Page 131: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman coding is a symbol code, we code one symbol at a time.

Is Huffman optimal? But what does optimal mean?

In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.

This is ok for D-adic distributions but could use up to one extra bitper symbol on average.

Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.

Thus, we need a long block to get any benefit.

In practice, this means we need to store and be able to computep(x1:n). No problem, right?

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.131/220)

Page 132: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Can we easily compute p(x1:n)?

If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.

Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).

Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine?

“dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found

Smoothing models are required. Similar to the language modelproblem in natural language processing.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.132/220)

Page 133: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Can we easily compute p(x1:n)?

If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.

Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).

Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine?

“dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found

Smoothing models are required. Similar to the language modelproblem in natural language processing.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.133/220)

Page 134: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Can we easily compute p(x1:n)?

If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.

Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).

Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine?

“dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found

Smoothing models are required. Similar to the language modelproblem in natural language processing.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.134/220)

Page 135: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Can we easily compute p(x1:n)?

If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.

Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).

Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine?

“dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found

Smoothing models are required. Similar to the language modelproblem in natural language processing.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.135/220)

Page 136: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Can we easily compute p(x1:n)?

If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.

Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).

Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine? “dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.

On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found

Smoothing models are required. Similar to the language modelproblem in natural language processing.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.136/220)

Page 137: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Can we easily compute p(x1:n)?

If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.

Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).

Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine? “dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found

Smoothing models are required. Similar to the language modelproblem in natural language processing.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.137/220)

Page 138: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Can we easily compute p(x1:n)?

If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.

Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).

Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine? “dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found

Smoothing models are required. Similar to the language modelproblem in natural language processing.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.138/220)

Page 139: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman has the property that

H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)

Bigger block sizes help, but we get

H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)

for the block.

If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.

If block gets too long, we have the estimation problem again (hardto compute p(x1:n),

also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.139/220)

Page 140: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman has the property that

H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)

Bigger block sizes help, but we get

H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)

for the block.

If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.

If block gets too long, we have the estimation problem again (hardto compute p(x1:n),

also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.140/220)

Page 141: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman has the property that

H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)

Bigger block sizes help, but we get

H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)

for the block.

If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.

If block gets too long, we have the estimation problem again (hardto compute p(x1:n),

also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.141/220)

Page 142: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman has the property that

H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)

Bigger block sizes help, but we get

H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)

for the block.

If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.

If block gets too long, we have the estimation problem again (hardto compute p(x1:n),

also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.142/220)

Page 143: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Huffman Codes

Huffman has the property that

H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)

Bigger block sizes help, but we get

H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)

for the block.

If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.

If block gets too long, we have the estimation problem again (hardto compute p(x1:n),

also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.143/220)

Page 144: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

There are other good symbol coding schemes as well.

In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words

Understanding this will be useful to understand arithmetic coding.

Again, in this case, we have full access to p(x).

X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.144/220)

Page 145: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

There are other good symbol coding schemes as well.

In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words

Understanding this will be useful to understand arithmetic coding.

Again, in this case, we have full access to p(x).

X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.145/220)

Page 146: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

There are other good symbol coding schemes as well.

In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words

Understanding this will be useful to understand arithmetic coding.

Again, in this case, we have full access to p(x).

X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.146/220)

Page 147: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

There are other good symbol coding schemes as well.

In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words

Understanding this will be useful to understand arithmetic coding.

Again, in this case, we have full access to p(x).

X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.147/220)

Page 148: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

There are other good symbol coding schemes as well.

In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words

Understanding this will be useful to understand arithmetic coding.

Again, in this case, we have full access to p(x).

X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.148/220)

Page 149: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Define F (x) =∑

a≤x p(a)

1 2 3 4

p(1)

p(2)

p(3)

p(4)

...

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F41/54(pg.149/220)

Page 150: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Define F (x) ,∑a<x

p(a) +1

2p(x) (10.18)

= F (x)− 1

2p(x) (10.19)

1 2 3 4

p(1)

p(2)

p(3)

p(4)

...F (x)

F (x)

F (x) is the point between F (x− 1) and F (x) so since p(x) > 0,

F (x− 1) < F (x) < F (x) (10.20)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F42/54(pg.150/220)

Page 151: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Define F (x) ,∑a<x

p(a) +1

2p(x) (10.18)

= F (x)− 1

2p(x) (10.19)

1 2 3 4

p(1)

p(2)

p(3)

p(4)

...F (x)

F (x)

F (x) is the point between F (x− 1) and F (x) so since p(x) > 0,

F (x− 1) < F (x) < F (x) (10.20)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F42/54(pg.151/220)

Page 152: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b

⇒ F (a) 6= F (b)⇔ F (a) 6= F (b)

.

So we can use F (a) as a non-singular code for a

(binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why?

But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).

E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.152/220)

Page 153: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)

⇔ F (a) 6= F (b)

.

So we can use F (a) as a non-singular code for a

(binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why?

But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).

E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.153/220)

Page 154: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).

So we can use F (a) as a non-singular code for a

(binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why?

But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).

E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.154/220)

Page 155: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).

So we can use F (a) as a non-singular code for a

(binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why?

But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).

E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.155/220)

Page 156: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).

So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why?

But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).

E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.156/220)

Page 157: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).

So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why?

But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).

E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.157/220)

Page 158: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).

So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).

E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.158/220)

Page 159: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).

So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).

E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.159/220)

Page 160: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).

So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x). E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.160/220)

Page 161: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).

So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x). E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.161/220)

Page 162: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).

So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)

code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.

Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x). E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.

How long to make `(x) to retain unique decodability?

F (x)− bF (x)c`(x) <1

2`(x)(10.21)

Example: When ` = 4, we have

code0.xxxx xxxx0.xxxx 0000-

=<

0.0000 xxxx0.0001 0000 =

�F (x)�4

1/2�(x)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.162/220)

Page 163: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.163/220)

Page 164: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)

=1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.164/220)

Page 165: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)=

1

22−dlog 1/p(x)e

≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.165/220)

Page 166: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x)

=p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.166/220)

Page 167: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1)

(10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.167/220)

Page 168: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1) (10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.168/220)

Page 169: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1) (10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.169/220)

Page 170: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1) (10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1)

(10.26)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.170/220)

Page 171: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

If `(x) = dlog 1/p(x)e+ 1, then

1

2`(x)=

1

22−dlog 1/p(x)e ≤ 1

22− log 1/p(x) =

p(x)

2(10.22)

= F (x)− F (x− 1) (10.23)

(10.24)

giving

F (x)− bF (x)c`(x) <1

2`(x)< F (x)− F (x− 1) (10.25)

⇒ bF (x)c`(x) > F (x− 1) (10.26)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.171/220)

Page 172: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

This gives

F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)

And thus bF (x)c`(x) bits is sufficient to use to describe xunambiguously if we use `(x) = dlog 1/p(x)e+ 1 bits.

Is this code prefix free?

Consider codeword z1z2 . . . z` to correspond to half-open interval

[ bF (x)c`(x)︷ ︸︸ ︷0.z1z2z3 . . . z`,

bF (x)c`(x)︷ ︸︸ ︷0.z1z2 . . . z1 +1/2`

)(10.28)

=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)

+0.00 . . . 1)

(10.30)

which has length 1/2` (all bin. numbers that start with 0.z1z2 . . . z`).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F45/54(pg.172/220)

Page 173: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

This gives

F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)

And thus bF (x)c`(x) bits is sufficient to use to describe xunambiguously if we use `(x) = dlog 1/p(x)e+ 1 bits.

Is this code prefix free?

Consider codeword z1z2 . . . z` to correspond to half-open interval

[ bF (x)c`(x)︷ ︸︸ ︷0.z1z2z3 . . . z`,

bF (x)c`(x)︷ ︸︸ ︷0.z1z2 . . . z1 +1/2`

)(10.28)

=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)

+0.00 . . . 1)

(10.30)

which has length 1/2` (all bin. numbers that start with 0.z1z2 . . . z`).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F45/54(pg.173/220)

Page 174: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

This gives

F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)

And thus bF (x)c`(x) bits is sufficient to use to describe xunambiguously if we use `(x) = dlog 1/p(x)e+ 1 bits.

Is this code prefix free?

Consider codeword z1z2 . . . z` to correspond to half-open interval

[ bF (x)c`(x)︷ ︸︸ ︷0.z1z2z3 . . . z`,

bF (x)c`(x)︷ ︸︸ ︷0.z1z2 . . . z1 +1/2`

)(10.28)

=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)

+0.00 . . . 1)

(10.30)

which has length 1/2` (all bin. numbers that start with 0.z1z2 . . . z`).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F45/54(pg.174/220)

Page 175: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

This gives

F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)

And thus bF (x)c`(x) bits is sufficient to use to describe xunambiguously if we use `(x) = dlog 1/p(x)e+ 1 bits.

Is this code prefix free?

Consider codeword z1z2 . . . z` to correspond to half-open interval

[ bF (x)c`(x)︷ ︸︸ ︷0.z1z2z3 . . . z`,

bF (x)c`(x)︷ ︸︸ ︷0.z1z2 . . . z1 +1/2`

)(10.28)

=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)

+0.00 . . . 1)

(10.30)

which has length 1/2` (all bin. numbers that start with 0.z1z2 . . . z`).

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F45/54(pg.175/220)

Page 176: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Viewing F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) and interval with

length 1/2`

F (x− 1) F (x) F (x)

�F (x)��(x)

possible values of the truncation

1/2�half-open interval

That is, bF (x)c`(x) ∈ (F (x− 1), F (x)], lives in the half openinterval.

But 2−`(x) ≤ p(x)/2 and F (x− 1) < bF (x)c`(x) ≤ F (x), so theopen intervals are disjoint even if bF (x)c`(x) = F (x).

thus, we have a prefix free code (i.e., if bF (x)c`(x) was a prefix ofanother codeword, that codeword would live in bF (x)c`(x)’s interval,but no other codeword does such a thing)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F46/54(pg.176/220)

Page 177: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Viewing F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) and interval with

length 1/2`

F (x− 1) F (x) F (x)

�F (x)��(x)

possible values of the truncation

1/2�half-open interval

That is, bF (x)c`(x) ∈ (F (x− 1), F (x)], lives in the half openinterval.

But 2−`(x) ≤ p(x)/2 and F (x− 1) < bF (x)c`(x) ≤ F (x), so theopen intervals are disjoint even if bF (x)c`(x) = F (x).

thus, we have a prefix free code (i.e., if bF (x)c`(x) was a prefix ofanother codeword, that codeword would live in bF (x)c`(x)’s interval,but no other codeword does such a thing)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F46/54(pg.177/220)

Page 178: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Viewing F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) and interval with

length 1/2`

F (x− 1) F (x) F (x)

�F (x)��(x)

possible values of the truncation

1/2�half-open interval

That is, bF (x)c`(x) ∈ (F (x− 1), F (x)], lives in the half openinterval.

But 2−`(x) ≤ p(x)/2 and F (x− 1) < bF (x)c`(x) ≤ F (x), so theopen intervals are disjoint even if bF (x)c`(x) = F (x).

thus, we have a prefix free code (i.e., if bF (x)c`(x) was a prefix ofanother codeword, that codeword would live in bF (x)c`(x)’s interval,but no other codeword does such a thing)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F46/54(pg.178/220)

Page 179: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Viewing F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) and interval with

length 1/2`

F (x− 1) F (x) F (x)

�F (x)��(x)

possible values of the truncation

1/2�half-open interval

That is, bF (x)c`(x) ∈ (F (x− 1), F (x)], lives in the half openinterval.

But 2−`(x) ≤ p(x)/2 and F (x− 1) < bF (x)c`(x) ≤ F (x), so theopen intervals are disjoint even if bF (x)c`(x) = F (x).

thus, we have a prefix free code (i.e., if bF (x)c`(x) was a prefix ofanother codeword, that codeword would live in bF (x)c`(x)’s interval,but no other codeword does such a thing)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F46/54(pg.179/220)

Page 180: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

So `(x) = dlog 1/p(x)e+ 1 suffices, and we have expected length

L =∑x

p(x)`(x) =∑x

p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2

(10.31)

Ex: dyadicx p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.5 0.75 0.5 0.10 2 103 0.125 0.875 0.8125 0.1101 4 11014 0.125 1.0 0.9375 0.111 4 1111

E` = 2.75 bits, while H = 1.75 bits.

On the other hand, Huffman achieves the entropy rate: Huffmantree (((3,4),1),2) getsE`huffman = 0.25× 1 + 0.25× 2 + 0.125× 3 + 0.125× 3 = 1.75

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F47/54(pg.180/220)

Page 181: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

So `(x) = dlog 1/p(x)e+ 1 suffices, and we have expected length

L =∑x

p(x)`(x) =∑x

p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2

(10.31)

Ex: dyadicx p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.5 0.75 0.5 0.10 2 103 0.125 0.875 0.8125 0.1101 4 11014 0.125 1.0 0.9375 0.111 4 1111

E` = 2.75 bits, while H = 1.75 bits.

On the other hand, Huffman achieves the entropy rate: Huffmantree (((3,4),1),2) getsE`huffman = 0.25× 1 + 0.25× 2 + 0.125× 3 + 0.125× 3 = 1.75

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F47/54(pg.181/220)

Page 182: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

So `(x) = dlog 1/p(x)e+ 1 suffices, and we have expected length

L =∑x

p(x)`(x) =∑x

p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2

(10.31)

Ex: dyadicx p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.5 0.75 0.5 0.10 2 103 0.125 0.875 0.8125 0.1101 4 11014 0.125 1.0 0.9375 0.111 4 1111

E` = 2.75 bits, while H = 1.75 bits.

On the other hand, Huffman achieves the entropy rate: Huffmantree (((3,4),1),2) getsE`huffman = 0.25× 1 + 0.25× 2 + 0.125× 3 + 0.125× 3 = 1.75

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F47/54(pg.182/220)

Page 183: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

So `(x) = dlog 1/p(x)e+ 1 suffices, and we have expected length

L =∑x

p(x)`(x) =∑x

p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2

(10.31)

Ex: dyadicx p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.5 0.75 0.5 0.10 2 103 0.125 0.875 0.8125 0.1101 4 11014 0.125 1.0 0.9375 0.111 4 1111

E` = 2.75 bits, while H = 1.75 bits.

On the other hand, Huffman achieves the entropy rate: Huffmantree (((3,4),1),2) getsE`huffman = 0.25× 1 + 0.25× 2 + 0.125× 3 + 0.125× 3 = 1.75

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F47/54(pg.183/220)

Page 184: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Ex: non-dyadic. Repeated digits: e.g., let 0.01010101 = 0.01x p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.25 0.5 0.375 0.011 3 0113 0.2 0.7 0.6 0.10011 4 10014 0.15 0.85 0.775 0.1100011 4 11005 0.15 1 0.925 0.1110110 4 1110

Again, not optimal H = 2.285, E` = 3.5, while E`huffman = 2.3,with Huffman tree ((1,(4,5)),(3,2))

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F48/54(pg.184/220)

Page 185: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Shannon/Fano/Elias Coding

Ex: non-dyadic. Repeated digits: e.g., let 0.01010101 = 0.01x p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.25 0.5 0.375 0.011 3 0113 0.2 0.7 0.6 0.10011 4 10014 0.15 0.85 0.775 0.1100011 4 11005 0.15 1 0.925 0.1110110 4 1110

Again, not optimal H = 2.285, E` = 3.5, while E`huffman = 2.3,with Huffman tree ((1,(4,5)),(3,2))

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F48/54(pg.185/220)

Page 186: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

On a particular code word, sometimes Shannon length is better thanHuffman and sometimes not (of course, on average Huffman isbetter).

Q: How likely is it that any other uniquely decodable code is shorterthan Shannon code for a particular codeword (we do Shannon onlybecause it relatively easy to analyze unlike Huffman lengths whichare define algorithmically and are thus harder to bound).

Theorem 10.4.1

Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code.

Then

Pr(`(X) ≥ `′(X) + c

)≤ 1

2c−1(10.32)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F49/54(pg.186/220)

Page 187: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

On a particular code word, sometimes Shannon length is better thanHuffman and sometimes not (of course, on average Huffman isbetter).

Q: How likely is it that any other uniquely decodable code is shorterthan Shannon code for a particular codeword (we do Shannon onlybecause it relatively easy to analyze unlike Huffman lengths whichare define algorithmically and are thus harder to bound).

Theorem 10.4.1

Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code.

Then

Pr(`(X) ≥ `′(X) + c

)≤ 1

2c−1(10.32)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F49/54(pg.187/220)

Page 188: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

On a particular code word, sometimes Shannon length is better thanHuffman and sometimes not (of course, on average Huffman isbetter).

Q: How likely is it that any other uniquely decodable code is shorterthan Shannon code for a particular codeword (we do Shannon onlybecause it relatively easy to analyze unlike Huffman lengths whichare define algorithmically and are thus harder to bound).

Theorem 10.4.1

Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code.

Then

Pr(`(X) ≥ `′(X) + c

)≤ 1

2c−1(10.32)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F49/54(pg.188/220)

Page 189: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

On a particular code word, sometimes Shannon length is better thanHuffman and sometimes not (of course, on average Huffman isbetter).

Q: How likely is it that any other uniquely decodable code is shorterthan Shannon code for a particular codeword (we do Shannon onlybecause it relatively easy to analyze unlike Huffman lengths whichare define algorithmically and are thus harder to bound).

Theorem 10.4.1

Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code. Then

Pr(`(X) ≥ `′(X) + c

)≤ 1

2c−1(10.32)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F49/54(pg.189/220)

Page 190: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)

= Pr(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.190/220)

Page 191: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.191/220)

Page 192: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.192/220)

Page 193: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.193/220)

Page 194: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.194/220)

Page 195: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.195/220)

Page 196: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.196/220)

Page 197: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.197/220)

Page 198: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

proof of Theorem 10.4.1.

Pr(`(X) ≥ `′(X) + c

)= Pr

(dlog 1/p(X)e ≥ `′(X) + c

)(10.33)

≤ Pr(

log 1/p(X) ≥ `′(X) + c− 1)

(10.34)

= Pr(p(X) ≤ 2−`

′(X)−c+1)

(10.35)

=∑

x:p(x)≤2−`(x)−c+1

p(x) (10.36)

≤∑

x:p(x)≤2−`(x)−c+1

2−`′(x)−c+1 (10.37)

≤∑x

2−`′(x)2−(c−1) (10.38)

≤ 2−(c−1) since∑x

2−`′(x) ≤ 1 by Kraft

(10.39)

thus, no code does better than Shannon most of the time.Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.198/220)

Page 199: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that

`(x) < `′(x) more often than `(x) > `′(x) (10.40)

where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.

Q: can this be true for all distributions?

A: No, since Huffman isbetter.

Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.

In fact, we have

Theorem 10.4.2

For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then

Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)

with equality iff `′(x) = `(x) ∀x.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.199/220)

Page 200: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that

`(x) < `′(x) more often than `(x) > `′(x) (10.40)

where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions?

A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.

In fact, we have

Theorem 10.4.2

For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then

Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)

with equality iff `′(x) = `(x) ∀x.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.200/220)

Page 201: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that

`(x) < `′(x) more often than `(x) > `′(x) (10.40)

where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.

Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.

In fact, we have

Theorem 10.4.2

For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then

Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)

with equality iff `′(x) = `(x) ∀x.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.201/220)

Page 202: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that

`(x) < `′(x) more often than `(x) > `′(x) (10.40)

where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.

In fact, we have

Theorem 10.4.2

For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then

Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)

with equality iff `′(x) = `(x) ∀x.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.202/220)

Page 203: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that

`(x) < `′(x) more often than `(x) > `′(x) (10.40)

where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer. In fact, we have

Theorem 10.4.2

For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then

Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)

with equality iff `′(x) = `(x) ∀x.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.203/220)

Page 204: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that

`(x) < `′(x) more often than `(x) > `′(x) (10.40)

where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer. In fact, we have

Theorem 10.4.2

For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then

Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)(10.41)

with equality iff `′(x) = `(x) ∀x.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.204/220)

Page 205: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.

Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . .

This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.205/220)

Page 206: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.

Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . .

This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.206/220)

Page 207: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.

Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.207/220)

Page 208: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.

Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.208/220)

Page 209: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.

Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.209/220)

Page 210: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.

Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.210/220)

Page 211: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.

Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.211/220)

Page 212: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.

Let sign(t) =

1 t > 0

0 t = 0

−1 t < 0

Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:

Pr(`′(X) < `(X)

)− Pr

(`′(X) > `(X)

)(10.42)

=∑

x:`′(x)≤`(x)

p(x)−∑

x:`′(x)>`(x)

p(x) (10.43)

=∑x

p(x) sign(`(x)− `′(x)

)(10.44)

= E[sign

(`(X)− `′(X)

)](10.45)

≤∑x

p(x)(

2`(x)−`′(x) − 1)

(10.46)

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.212/220)

Page 213: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

. . . proof of Theorem 10.4.2.

But since p(x) is dyadic, p(x) = 2−`(x), we get

=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)

≤ 1− 1 since `′(x) satisfies Kraft (10.50)

= 0

(10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.213/220)

Page 214: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

. . . proof of Theorem 10.4.2.

But since p(x) is dyadic, p(x) = 2−`(x), we get

=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)

≤ 1− 1 since `′(x) satisfies Kraft (10.50)

= 0

(10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.214/220)

Page 215: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

. . . proof of Theorem 10.4.2.

But since p(x) is dyadic, p(x) = 2−`(x), we get

=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)

≤ 1− 1 since `′(x) satisfies Kraft (10.50)

= 0

(10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.215/220)

Page 216: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

. . . proof of Theorem 10.4.2.

But since p(x) is dyadic, p(x) = 2−`(x), we get

=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)

≤ 1− 1 since `′(x) satisfies Kraft (10.50)

= 0

(10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.216/220)

Page 217: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

. . . proof of Theorem 10.4.2.

But since p(x) is dyadic, p(x) = 2−`(x), we get

=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)

≤ 1− 1 since `′(x) satisfies Kraft (10.50)

= 0 (10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.217/220)

Page 218: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Competitive optimality of Shannon code

. . . proof of Theorem 10.4.2.

But since p(x) is dyadic, p(x) = 2−`(x), we get

=∑x

2−`(x)(2`(x)−`′(x) − 1

)(10.47)

=∑x

2−`′(x) −

∑x

2−`(x) (10.48)

=∑x

2−`′(x) − 1 (10.49)

≤ 1− 1 since `′(x) satisfies Kraft (10.50)

= 0 (10.51)

Thus, Pr(`(X) < `′(X)

)≥ Pr

(`(X) > `′(X)

)as desired.

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.218/220)

Page 219: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Next time

Shannon games and stream codes

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F54/54(pg.219/220)

Page 220: EE514a Information Theory I Fall Quarter 2019

Huffman Shannon/Fano/Elias Next

Next time

Shannon games and stream codes

Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F54/54(pg.220/220)


Recommended