Post on 02-Mar-2022
transcript
EE514a – Information Theory IFall Quarter 2019
Prof. Jeff Bilmes
University of Washington, SeattleDepartment of Electrical & Computer Engineering
Fall Quarter, 2019https://class.ece.uw.edu/514/bilmes/ee514_fall_2019/
Lecture 10 - Oct 30th, 2019
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F1/54(pg.1/220)
Logistics Review
Class Road Map - IT-I
L1 (9/25): Overview, Communications,Information, Entropy
L2 (9/30): Entropy, Mutual Information,KL-Divergence
L3 (10/2): More KL, Jensen, more Venn,Log Sum,Data Proc. Inequality
L4 (10/7): Data Proc. Ineq.,thermodynamics, Stats, Fano,
L5 (10/9): M. of Conv, AEP,
L6 (10/14): AEP, Source Coding, Types
LX (10/16): Makeup
L7 (10/21): Types, Univ. Src Coding,Stoc. Procs, Entropy Rates
L8 (10/23): Entropy rates, HMMs,Coding
L9 (10/28): Kraft ineq., Shannon Codes,Kraft ineq. II, Huffman
L10 (10/30): Huffman,Shannon/Fano/Elias
L11 (11/4):
LXX (11/6): In class midterm exam
L12 (11/11): Veterans Day (Makeuplecture)
L13 (11/13):
L14 (11/18):
L15 (11/20):
L16 (11/25):
L17 (11/27):
L18 (12/2):
L19 (12/4):
LXX (12/10): Final exam
Finals Week: December 9th–13th.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F2/54(pg.2/220)
Logistics Review
Cumulative Outstanding Reading
Read chapters 1 and 2 in our book (Cover & Thomas, “InformationTheory”) (including Fano’s inequality).
Read chapters 3 and 4 in our book (Cover & Thomas, “InformationTheory”).
Read sections 11.1 through 11.3 in our book (Cover & Thomas,“Information Theory”).
Read chapter 4 in our book (Cover & Thomas, “InformationTheory”).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F3/54(pg.3/220)
Logistics Review
Homework
Homework 1 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), wasdue Tuesday, Oct 8th, 11:55pm.
Homework 2 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), dueFriday 10/18/2019, 11:45pm.
Homework 3 on our assignment dropbox(https://canvas.uw.edu/courses/1319497/assignments), dueTuesday 10/29/2019, 11:45pm.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F4/54(pg.4/220)
Logistics Review
Kraft inequality
Theorem 10.2.1 (Kraft inequality)
For any instantaneous code (prefix code) over alphabet of size D, thecodeword lengths `1, `2, . . . , `m must satisfy∑
i
D−`i ≤ 1 (10.1)
Conversely, given a set of codeword lengths satisfying the aboveinequality, ∃ an instantaneous code with these word lengths.
Note: converse says there exists a code with these lengths, not thatall codes with these lengths will satisfy the inequality.Key point: for `i satisfying Kraft, no further restriction imposed byalso wanting a prefix code, so we might as well use a prefix code(assuming it is easy to find given the lengths)Connects code existence to mathematical property on lengths!Given Kraft lengths, can construct an instantaneous code (as we willsee). Given lengths, can compute E[`] and compare with H.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F5/54(pg.5/220)
Logistics Review
Towards Optimal Codes
Summarizing: Prefix code ⇔ Kraft inequality.
Thus, we need only find lengths that satisfy Kraft to find a prefixcode.
Goal: find a prefix code with minimum expected length
L(C) =∑i
pi`i (10.5)
This is a constrained optimization problem:
minimize{`1:m}∈Zm
++
∑i
pi`i (10.6)
subject to∑i
D−`i ≤ 1
Integer program is an NP-complete optimization, not likely to beefficiently solvable (unless P=NP).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F6/54(pg.6/220)
Logistics Review
Towards Optimal Codes
Relax the integer constraints on `i for now, and consider Lagrangian
J =∑i
pi`i + λ(∑i
D−`i − 1) (10.5)
Taking derivatives and setting to 0,
∂J
∂`i= pi − λD−`i lnD = 0 (10.6)
⇒ D−`i =pi
λ lnD(10.7)
∂J
∂λ=∑i
D−`i − 1 = 0 ⇒ λ = 1/ lnD (10.8)
⇒ D−`i = pi yielding `∗i = − logD pi (10.9)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F7/54(pg.7/220)
Logistics Review
Optimal Code Lengths
Theorem 10.2.2
Entropy is the minimum expected length. That is, the expected length Lof any instantaneous D-ary code (which thus satisfies Kraft inequality)for a r.v. X is such that
L ≥ HD(X) (10.6)
with equality iff D−`i = pi.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F8/54(pg.8/220)
Logistics Review
Optimal Code Lengths
. . . Proof of Theorem ??.
So we have that L ≥ HD(X).
Equality, L = H is achieved iff pi = D−`i for all i ⇔ − logD pi is aninteger . . .
. . . in which case c =∑
iD−`i = 1
Definition 10.2.2 (D-adic)
A probability distribution is called D-adic w.r.t. D if each of theprobabilities is = D−n for some n.
Ex: when D = 2, the distribution [12 ,
14 ,
18 ,
18 ] = [2−1, 2−2, 2−3, 2−3]
is 2-adic.
Thus, we have equality above iff the distribution is appropriatelyD-adic.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F9/54(pg.9/220)
Logistics Review
Shannon Codes
L−H = D(p||r) + logD 1/c, with c =∑
iD−`i
Thus, to produce a code, we find closest (in the KL sense) D-adicdistribution w.r.t. D to p and then construct the code as in theproof of the Kraft inequality converse.
In general, however, unless P=NP, it is hard to find the KL closestD-adic distribution (integer programming problem).
Shannon codes: consider `i = dlogD 1/pie as the code lengths∑iD−`i =
∑iD−dlog 1/pie ≤∑iD
− log 1/pi =∑
i pi = 1
This means Kraft inequality holds for these lengths, so there is aprefix code (if the lengths were too short there might be a problembut we’re rounding up).
Also, we have a bound on lengths in terms of real numbers
logD1
pi≤ `i < logD
1
pi+ 1 (10.12)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F10/54(pg.10/220)
Logistics Review
How bad is one bit?
How bad is this overhead?
Depends on H. Efficiency of code
0 ≤ Efficiency ,HD(X)
E`(X)≤ 1 (10.14)
If E`(X) = HD(X) + 1, then efficiency → 1 as H(X)→∞.
efficiency → 0 as H(X)→ 0, so entropy would need to be verylarge for this to be good.
For small alphabets (or low-entropy distributions, such as close todeterministic distributions), impossible to have good efficiency. E.g.,D = {0, 1} then maxH(X) = 1, so best possible efficiency is 50%/.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F11/54(pg.11/220)
Logistics Review
Improving efficiency
Such symbol codes are inherently disadvantaged, unless theirdistributions are D-adic.
We can reduce overhead (improve efficiency) by coding > 1 symbolat a time (block code, or a vector code, the symbol is the vector).
Let Ln be the expected per-symbol length for n symbols x1:n. Ln isthe expected per-symbol length, when encoding n symbols.
Ln =1
n
∑x1:n
p(x1:n)`(x1:n) =1
nE`(x1:n) (10.14)
Lets use Shannon coding lengths to get∑i
pi
(log 1/pi ≤ `i ≤ log 1/pi + 1
)(10.15)
⇒ H(X1, . . . , Xn) ≤ E`(X1:n) < H(X1, . . . , Xn) + 1 (10.16)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F12/54(pg.12/220)
Logistics Review
Coding with the wrong distribution
Theorem 10.2.4
Expected length under p(x)of code with `(x) = dlog 1/q(x)e satisfies
H(p) +D(p||q) ≤ Ep`(X) ≤ H(p) +D(p||q) + 1 (10.22)
l.h.s. is the best we can do with the wrong distribution q when thetrue distribution is p.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F13/54(pg.13/220)
Logistics Review
Kraft revisited
We proved Kraft inequality is true for instantaneous codes (and viceverse).Could it be true for all uniquely decodable codes?Could larger class of codes have shorter expected codeword lengths?Since larger, we might (naıvely) expect that we could do better.
Theorem 10.2.4
Codeword lengths of any uniquely decodable code (not. nec.instantaneous) must satisfy Kraft inequality
∑iD−`i ≤ 1. Conversely,
given a set of codeword lengths that satisfy Kraft, it is possible toconstruct a uniquely decodable code.
Proof.
Proof converse we already saw before (given lengths, we can construct aprefix code which is thus uniquely decodable). Thus we only need provethe first part. . . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F14/54(pg.14/220)
Logistics Review
Huffman coding
A procedure for finding shortest expected length prefix code.
You’ve probably encountered it in computer science classes (a classicalgorithm).
Here we analyze it armed with the tools of information theory.
Quest: given a p(x), find a code (bit strings and set of lengths) thatis as short as possible, and also an instantaneous code (prefix free).
We could do this greedily: start at the top and split the potentialcodewords into even probabilities (i.e., asking the question withhighest entropy)
This is similar to the game of 20 questions. We have a set ofobjects, w.l.o.g. the set S = {1, 2, 3, 4, . . . ,m} that occur withfrequency proportional to non-negative (w1, w2, . . . , wm).
We wish to determine an object from this class asking as fewquestions as possible.
Supposing X ∈ S, each question can take the form “Is X ∈ A?” forsome A ⊆ S.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F15/54(pg.15/220)
Logistics Review
20 Questions
Question tree. S = {x1, x2, x3, x4, x5}.
X ∈ {x2, x3}
X ∈ {x2
x2
x3
x1
x4
x5
}
X ∈ {x1}
X ∈ {x4}
0.2
0.2
0.30.15
0.15
Y
Y
Y
YN
N
NN
How do we construct such a tree? Charles Sanders Peirce, 1901 said:Thus twenty skillful hypotheses will ascertain what two hundredthousand stupid ones might fail to do. The secret of the businesslies in the caution which breaks a hypothesis up into its smallestlogical components, and only risks one of them at a time.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F16/54(pg.16/220)
Logistics Review
The Greedy Method for Finding a Code
Suggests a greedy method. “Do next whatever currently looks best.”
Consider following table:a b c d e f g
p 0.01 0.24 0.05 0.20 0.47 0.01 0.02
The question that looks best would infer the most about thedistribution, one with the largest entropy.
H(X|Y1) = H(X,Y1)−H(Y1) = H(X)−H(Y1), so choosing aquestion Y1 with large entropy leads to least “residual” uncertaintyH(X|Y1) about X.
Identically, we choose the question Y1 with the greatest mutualinformation about X since in this caseI(Y1;X) = H(X)−H(X|Y1) = H(Y1).
Again, questions take the form “Is X ∈ A?” for some A ⊆ S, sochoosing a yes/no (binary) question means choosing the set A.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F17/54(pg.17/220)
Huffman Shannon/Fano/Elias Next
The Greedy Method
We’ll use greedy, and choose the question (set) with the greatestentropy.
If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.
This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).
Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.
Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.18/220)
Huffman Shannon/Fano/Elias Next
The Greedy Method
We’ll use greedy, and choose the question (set) with the greatestentropy.
If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.
This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).
Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.
Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.19/220)
Huffman Shannon/Fano/Elias Next
The Greedy Method
We’ll use greedy, and choose the question (set) with the greatestentropy.
If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.
This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).
Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.
Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.20/220)
Huffman Shannon/Fano/Elias Next
The Greedy Method
We’ll use greedy, and choose the question (set) with the greatestentropy.
If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.
This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).
Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.
Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.21/220)
Huffman Shannon/Fano/Elias Next
The Greedy Method
We’ll use greedy, and choose the question (set) with the greatestentropy.
If we consider the partition{a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “IsX ∈ {e, f, g}?” would have maximum entropy sincep(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5.
This question corresponds to random variable Y1 = 1{X∈{e,f,g}} soH(Y1) = 1 and this would be considered a good question (as goodas it gets for binary r.v.).
Since H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1), we greedilyfind the next question Y2 that has maximum conditional entropyH(Y2|Y1) to minimize remaining uncertainty about X.
Hence, next question depends on the outcome of the first, and wehave either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F18/54(pg.22/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.
This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).
If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:
case I II III
split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)
H(Y2|Y1 = 1) 0.3274 0.2423 0.1414
Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274
Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.23/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.
This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).
If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:
case I II III
split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)
H(Y2|Y1 = 1) 0.3274 0.2423 0.1414
Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274
Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.24/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.
This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).
If Y1 = 1, then we need to partition the set {e, f, g}.
We can do this in one of three ways:case I II III
split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)
H(Y2|Y1 = 1) 0.3274 0.2423 0.1414
Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274
Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.25/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.
This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).
If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:
case I II III
split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)
H(Y2|Y1 = 1) 0.3274 0.2423 0.1414
Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274
Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.26/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.
This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).
If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:
case I II III
split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)
H(Y2|Y1 = 1) 0.3274 0.2423 0.1414
Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274
Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.27/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
If Y1 = 0 then we can split to maximize entropy as follows: partition{a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4.
This question corresponds to random variable Y2 = 1{X∈{c,d}} soH(Y2|Y1 = 0) = 1 and this would also be considered a goodquestion (as good as it gets).
If Y1 = 1, then we need to partition the set {e, f, g}.We can do this in one of three ways:
case I II III
split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f})prob (0.47,0.03) (0.48,0.2) (0.49,0.1)
H(Y2|Y1 = 1) 0.3274 0.2423 0.1414
Thus, we would choose case I for Y2 since that is the maximumentropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274
Recall: H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) =H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F19/54(pg.28/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.
Summarizing all questions/splits, & their conditional entropies:set split probabilities conditional entropy
{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183
Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall
H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)
= H(Y1) +∑
i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)
+∑
i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.29/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.
Summarizing all questions/splits, & their conditional entropies:
set split probabilities conditional entropy{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183
Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall
H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)
= H(Y1) +∑
i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)
+∑
i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.30/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.
Summarizing all questions/splits, & their conditional entropies:
set split probabilities conditional entropy{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183
Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall
H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)
= H(Y1) +∑
i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)
+∑
i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.31/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.
Summarizing all questions/splits, & their conditional entropies:set split probabilities conditional entropy
{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183
Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall
H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)
= H(Y1) +∑
i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)
+∑
i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.32/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.
Summarizing all questions/splits, & their conditional entropies:set split probabilities conditional entropy
{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183
Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall
H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)
= H(Y1) +∑
i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)
+∑
i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.33/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
Once we get to sets of size 2, we only have one possible question.Greedy strategy always greedily chooses what currently looks best,ignoring future. Latter questions must live with what is available.
Summarizing all questions/splits, & their conditional entropies:set split probabilities conditional entropy
{a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1{a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1{e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274{a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423{c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219{e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0{f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183
Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall
H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1)
= H(Y1) +∑
i∈{0,1}H(Y2|Y1 = i)p(Y1 = i) (10.2)
+∑
i,j∈{0,1}H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F20/54(pg.34/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
a b c d e f g
p 0.01 0.24 0.05 0.20 0.47 0.01 0.02
This leads to the following (top-down greedily constructed) tree:
{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?
f g
{f, g}
X∈
?
0
0
0
000 001 010 011 110 111
10
0 0
0
1
1
11 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.03
The expected length of this code E` = 2.5300.
Entropy: H = 1.9323.
Code efficiency H/E` = 1.9323/2.5300 = 0.7638.
Can we do better?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.35/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
a b c d e f g
p 0.01 0.24 0.05 0.20 0.47 0.01 0.02
This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?
f g
{f, g}
X∈
?
0
0
0
000 001 010 011 110 111
10
0 0
0
1
1
11 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.03
The expected length of this code E` = 2.5300.
Entropy: H = 1.9323.
Code efficiency H/E` = 1.9323/2.5300 = 0.7638.
Can we do better?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.36/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
a b c d e f g
p 0.01 0.24 0.05 0.20 0.47 0.01 0.02
This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?
f g
{f, g}
X∈
?
0
0
0
000 001 010 011 110 111
10
0 0
0
1
1
11 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.03
The expected length of this code E` = 2.5300.
Entropy: H = 1.9323.
Code efficiency H/E` = 1.9323/2.5300 = 0.7638.
Can we do better?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.37/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
a b c d e f g
p 0.01 0.24 0.05 0.20 0.47 0.01 0.02
This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?
f g
{f, g}
X∈
?
0
0
0
000 001 010 011 110 111
10
0 0
0
1
1
11 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.03
The expected length of this code E` = 2.5300.
Entropy: H = 1.9323.
Code efficiency H/E` = 1.9323/2.5300 = 0.7638.
Can we do better?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.38/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
a b c d e f g
p 0.01 0.24 0.05 0.20 0.47 0.01 0.02
This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?
f g
{f, g}
X∈
?
0
0
0
000 001 010 011 110 111
10
0 0
0
1
1
11 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.03
The expected length of this code E` = 2.5300.
Entropy: H = 1.9323.
Code efficiency H/E` = 1.9323/2.5300 = 0.7638.
Can we do better?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.39/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree
a b c d e f g
p 0.01 0.24 0.05 0.20 0.47 0.01 0.02
This leads to the following (top-down greedily constructed) tree:{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?
f g
{f, g}
X∈
?
0
0
0
000 001 010 011 110 111
10
0 0
0
1
1
11 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.03
The expected length of this code E` = 2.5300.
Entropy: H = 1.9323.
Code efficiency H/E` = 1.9323/2.5300 = 0.7638.
Can we do better?Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F21/54(pg.40/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman
{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?
f g
{f, g}
X∈
?
0
0
0
000
000000 000001
00001
0001
001
01
1
001 010 011 110 111
10
0 0
0
1
1
1
0 1
0 1
0 1
0 1
0 1
1 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.470.53
0.240.29
0.200.09
0.050.04
0.020.02
0.010.01
0.03
a f
g
c
d
b
e
0 1
The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.41/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman
{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?f g
{f, g}
X∈
?
0
0
0
000
000000 000001
00001
0001
001
01
1
001 010 011 110 111
10
0 0
0
1
1
1
0 1
0 1
0 1
0 1
0 1
1 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.470.53
0.240.29
0.200.09
0.050.04
0.020.02
0.010.01
0.03
a f
g
c
d
b
e
0 1
The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.42/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman
{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?f g
{f, g}
X∈
?
0
0
0
000
000000 000001
00001
0001
001
01
1
001 010 011 110 111
10
0 0
0
1
1
1
0 1
0 1
0 1
0 1
0 1
1 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.470.53
0.240.29
0.200.09
0.050.04
0.020.02
0.010.01
0.03
a f
g
c
d
b
e
0 1
The Huffman lengths have E`huffman = 1.9700
Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.43/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman
{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?f g
{f, g}
X∈
?
0
0
0
000
000000 000001
00001
0001
001
01
1
001 010 011 110 111
10
0 0
0
1
1
1
0 1
0 1
0 1
0 1
0 1
1 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.470.53
0.240.29
0.200.09
0.050.04
0.020.02
0.010.01
0.03
a f
g
c
d
b
e
0 1
The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809
Key problem: Greedy procedure is not optimal in this case.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.44/220)
Huffman Shannon/Fano/Elias Next
The Greedy Tree vs. Huffman TreeLeft is greedy, right is Huffman
{a, b, c, d, e, f, g}
X ∈ {a, b, c, d}?
dc
X∈{c, d}?
a b
X∈ {
a,b}?
{e, f, g}X ∈ ?
e{e}
X∈
?f g
{f, g}
X∈
?
0
0
0
000
000000 000001
00001
0001
001
01
1
001 010 011 110 111
10
0 0
0
1
1
1
0 1
0 1
0 1
0 1
0 1
1 1
1
0.5 0.5
0.25
0.01 0.24 0.05 0.2 0.01 0.02
0.25
0.47
0.470.53
0.240.29
0.200.09
0.050.04
0.020.02
0.010.01
0.03
a f
g
c
d
b
e
0 1
The Huffman lengths have E`huffman = 1.9700Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809Key problem: Greedy procedure is not optimal in this case.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F22/54(pg.45/220)
Huffman Shannon/Fano/Elias Next
Greedy
Why does starting from the top and splitting as such non-optimal?Where can it go wrong?
Ex: There may be many ways to get a ≈ 50% split (to achieve highentropy) once done, the split is irrevocable and there is no way toknow if the consequences of that split might hurt down the line.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F23/54(pg.46/220)
Huffman Shannon/Fano/Elias Next
Greedy
Why does starting from the top and splitting as such non-optimal?Where can it go wrong?
Ex: There may be many ways to get a ≈ 50% split (to achieve highentropy) once done, the split is irrevocable and there is no way toknow if the consequences of that split might hurt down the line.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F23/54(pg.47/220)
Huffman Shannon/Fano/Elias Next
Huffman
The Huffman code tree procedure
1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,
and will differ in the last digit.3 Combine these two symbols into a joint symbol having probability
equal to the sum, add the joint symbol and then remove the twosymbols, and repeat.
Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.48/220)
Huffman Shannon/Fano/Elias Next
Huffman
The Huffman code tree procedure1 take the two least probable symbols in the alphabet.
2 These two will be given the longest codewords, will have equal length,and will differ in the last digit.
3 Combine these two symbols into a joint symbol having probabilityequal to the sum, add the joint symbol and then remove the twosymbols, and repeat.
Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.49/220)
Huffman Shannon/Fano/Elias Next
Huffman
The Huffman code tree procedure1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,
and will differ in the last digit.
3 Combine these two symbols into a joint symbol having probabilityequal to the sum, add the joint symbol and then remove the twosymbols, and repeat.
Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.50/220)
Huffman Shannon/Fano/Elias Next
Huffman
The Huffman code tree procedure1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,
and will differ in the last digit.3 Combine these two symbols into a joint symbol having probability
equal to the sum, add the joint symbol and then remove the twosymbols, and repeat.
Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.51/220)
Huffman Shannon/Fano/Elias Next
Huffman
The Huffman code tree procedure1 take the two least probable symbols in the alphabet.2 These two will be given the longest codewords, will have equal length,
and will differ in the last digit.3 Combine these two symbols into a joint symbol having probability
equal to the sum, add the joint symbol and then remove the twosymbols, and repeat.
Note that it is bottom up (agglomerative clustering) rather than topdown (greedy splitting).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F24/54(pg.52/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.
So 4 and 5 should have longest code length
We build the tree from left to right.
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.53/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length
We build the tree from left to right.
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.54/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length
We build the tree from left to right.
X12345
0.250.250.20.150.15
probcodeword
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.55/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length
We build the tree from left to right.
X12345
0.250.250.20.150.15
probcodeword
0.250.250.20.3
probstep 1
01
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.56/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length
We build the tree from left to right.
X12345
0.250.250.20.150.15
probcodeword
0.250.250.20.3
prob
0.250.45
0.3
probstep 1 step 2
01
01
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.57/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length
We build the tree from left to right.
X12345
0.250.250.20.150.15
probcodeword
0.250.250.20.3
prob
0.250.45
0.3
prob
0.550.45
probstep 1 step 2 step 3
01
01
0
1
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.58/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length
We build the tree from left to right.
X12345
0.250.250.20.150.15
probcodeword
0.250.250.20.3
prob
0.250.45
0.3
prob
0.550.45
prob
1.0probstep 1 step 2 step 3 step 4
01
01
01
0
1
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.59/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length
We build the tree from left to right.
Xlog1
p(x)
12345
22233
2.02.02.32.72.7
001011010011
0.250.250.20.150.15
probcodewordlength
0.250.250.20.3
prob
0.250.45
0.3
prob
0.550.45
prob
1.0probstep 1 step 2 step 3 step 4
01
01
01
0
1
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.60/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length
We build the tree from left to right.
Xlog1
p(x)
12345
22233
2.02.02.32.72.7
001011010011
0.250.250.20.150.15
probcodewordlength
0.250.250.20.3
prob
0.250.45
0.3
prob
0.550.45
prob
1.0probstep 1 step 2 step 3 step 4
01
01
01
0
1
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.61/220)
Huffman Shannon/Fano/Elias Next
Huffman
Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}.So 4 and 5 should have longest code length
We build the tree from left to right.
Xlog1
p(x)
12345
22233
2.02.02.32.72.7
001011010011
0.250.250.20.150.15
probcodewordlength
0.250.250.20.3
prob
0.250.45
0.3
prob
0.550.45
prob
1.0probstep 1 step 2 step 3 step 4
01
01
01
0
1
So we have E` = 2.3 bits and H = 2.2855 bits, as you can see thiscode does pretty well (close to entropy).
Some code lengths are shorter/longer than I(x) = log 1/p(x).
Construction is similar for D > 2, in such case we might add dummysymbols to alphabet X to allow D-ary tree.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F25/54(pg.62/220)
Huffman Shannon/Fano/Elias Next
More Huffman vs. Shannon
Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.
Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.
Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).
But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.
In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.63/220)
Huffman Shannon/Fano/Elias Next
More Huffman vs. Shannon
Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.
Optimal code lengths are not always ≤ dlog 1/pie.
Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.
Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).
But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.
In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.64/220)
Huffman Shannon/Fano/Elias Next
More Huffman vs. Shannon
Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.
Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.
Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).
But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.
In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.65/220)
Huffman Shannon/Fano/Elias Next
More Huffman vs. Shannon
Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.
Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.
Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).
But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.
In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.66/220)
Huffman Shannon/Fano/Elias Next
More Huffman vs. Shannon
Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.
Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.
Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).
But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.
In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.67/220)
Huffman Shannon/Fano/Elias Next
More Huffman vs. Shannon
Shannon code lengths `i = dlog 1/pie we saw are not optimal –more realistic example, binary alphabet with probabilitiesp(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and`b = 14 bits, with E` = 1.0013 > 1.
Optimal code lengths are not always ≤ dlog 1/pie.Consider X with probabilities (1/3, 1/3, 1/4, 1/12) withH = 1.8554.
Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3)(with ELh1 = ELh2 = 2).
But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths areLs = (2, 2, 2, 4) with ELs = 2.1667 > 2.
In general, a particular codeword for the optimal code might belonger than Shannon’s length, but of course this is not true onaverage.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F26/54(pg.68/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Huffman is optimal, i.e.,∑
i pi`i is minimal, for integer lengths.
To show this:
1 First show lemma that some optimal codes have certain properties(not all, but that ∃ optimal code with these properties).
2 Given a code Cm for m symbols, that has said properties, producenew simpler code satisfying lemma and is simpler to optimize.
3 Ultimately get down to simple case of two symbols which are obviousto optimize.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.69/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Huffman is optimal, i.e.,∑
i pi`i is minimal, for integer lengths.
To show this:
1 First show lemma that some optimal codes have certain properties(not all, but that ∃ optimal code with these properties).
2 Given a code Cm for m symbols, that has said properties, producenew simpler code satisfying lemma and is simpler to optimize.
3 Ultimately get down to simple case of two symbols which are obviousto optimize.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.70/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Huffman is optimal, i.e.,∑
i pi`i is minimal, for integer lengths.
To show this:1 First show lemma that some optimal codes have certain properties
(not all, but that ∃ optimal code with these properties).
2 Given a code Cm for m symbols, that has said properties, producenew simpler code satisfying lemma and is simpler to optimize.
3 Ultimately get down to simple case of two symbols which are obviousto optimize.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.71/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Huffman is optimal, i.e.,∑
i pi`i is minimal, for integer lengths.
To show this:1 First show lemma that some optimal codes have certain properties
(not all, but that ∃ optimal code with these properties).2 Given a code Cm for m symbols, that has said properties, produce
new simpler code satisfying lemma and is simpler to optimize.
3 Ultimately get down to simple case of two symbols which are obviousto optimize.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.72/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Huffman is optimal, i.e.,∑
i pi`i is minimal, for integer lengths.
To show this:1 First show lemma that some optimal codes have certain properties
(not all, but that ∃ optimal code with these properties).2 Given a code Cm for m symbols, that has said properties, produce
new simpler code satisfying lemma and is simpler to optimize.3 Ultimately get down to simple case of two symbols which are obvious
to optimize.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F27/54(pg.73/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Lemma 10.3.1
For all distributions, ∃ an optimal instantaneous code (i.e., minimalexpected length) simultaneously satisfying:
1 if pj > pk then lj ≤ lk (i.e., the more probable symbol does nothave a longer length)
2 The two longest codewords have the same length
3 Two longest codewords differ only in last bit and correspond to thetwo least likely symbols.
Proof.
Suppose Cm is optimal code (so L(Cm) is minimum) and choosej, k such that pj > pk. Need to show ∃ code with lj ≤ lk.
Consider C ′m with codewords j and k swapped meaning
`′j = `k and `′k = `j (10.3)
which can only make the code longer, so L(C ′m) ≥ L(Cm) . . .Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F28/54(pg.74/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0
≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸
>0
(`k − `j)︸ ︷︷ ︸
⇒≥0
≥ 0
(10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.75/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm)
=∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸
>0
(`k − `j)︸ ︷︷ ︸
⇒≥0
≥ 0
(10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.76/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸
>0
(`k − `j)︸ ︷︷ ︸
⇒≥0
≥ 0
(10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.77/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸
>0
(`k − `j)︸ ︷︷ ︸
⇒≥0
≥ 0
(10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.78/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸
>0
(`k − `j)︸ ︷︷ ︸
⇒≥0
≥ 0
(10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.79/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸
>0
(`k − `j)︸ ︷︷ ︸
⇒≥0
≥ 0
(10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.80/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸
>0
(`k − `j)︸ ︷︷ ︸
⇒≥0
≥ 0
(10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.81/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸
>0
(`k − `j)︸ ︷︷ ︸
⇒≥0
≥ 0 (10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.82/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸>0
(`k − `j)︸ ︷︷ ︸
⇒≥0
≥ 0 (10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.83/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸>0
(`k − `j)︸ ︷︷ ︸⇒≥0
≥ 0 (10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.84/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸>0
(`k − `j)︸ ︷︷ ︸⇒≥0
≥ 0 (10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem).
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.85/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
With this swap, since L(Cm) is minimal, we have
0 ≤ L(C ′m)− L(Cm) =∑i
pi`′i −∑i
pi`i (10.4)
= pj`′j + pk`
′k − pj`j − pk`k (10.5)
= pj`k + pk`j − pj`j − pk`k (10.6)
= pj(`k − `j)− pk(`k − `j) (10.7)
= (pj − pk)︸ ︷︷ ︸>0
(`k − `j)︸ ︷︷ ︸⇒≥0
≥ 0 (10.8)
Thus, `k ≥ `j when pj > pk and the code satisfies property 1.
In fact, this property is true for all optimal codes (stronger than the“there exists” statement of the theorem). . . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F29/54(pg.86/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 2 (longest codewords have the same length).
If two longest codewords are not the same length, then delete thelast bit of the longer one.
⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.
if siblings after deletion if not siblings after deletion
......
⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length.
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.87/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 2 (longest codewords have the same length).
If two longest codewords are not the same length, then delete thelast bit of the longer one.
⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.
if siblings after deletion if not siblings after deletion
......
⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length.
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.88/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 2 (longest codewords have the same length).
If two longest codewords are not the same length, then delete thelast bit of the longer one. ⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.
if siblings after deletion if not siblings after deletion
......
⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length.
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.89/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 2 (longest codewords have the same length).
If two longest codewords are not the same length, then delete thelast bit of the longer one. ⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.
if siblings after deletion if not siblings after deletion
......
⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length.
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.90/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 2 (longest codewords have the same length).
If two longest codewords are not the same length, then delete thelast bit of the longer one. ⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.
if siblings after deletion if not siblings after deletion
......
⇒ we have reduced expected length.
⇒ optimal code must havetwo longest codewords with the same length.
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.91/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 2 (longest codewords have the same length).
If two longest codewords are not the same length, then delete thelast bit of the longer one. ⇒ we retain the prefix property, since thelongest codeword is unique in its length has no prefix that is a code.
if siblings after deletion if not siblings after deletion
......
⇒ we have reduced expected length. ⇒ optimal code must havetwo longest codewords with the same length. . . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F30/54(pg.92/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).
Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.
Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.
If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:
pm
pm−1
pm
pm−1 di�ers onlyin last bit
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.93/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).
Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.
Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.
If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:
pm
pm−1
pm
pm−1 di�ers onlyin last bit
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.94/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).
Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.
Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.
If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:
pm
pm−1
pm
pm−1 di�ers onlyin last bit
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.95/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).
Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.
Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.
If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:
pm
pm−1
pm
pm−1 di�ers onlyin last bit
. . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.96/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
Property 3 (two longest codewords differ only in last bit &correspond to two least likely source symbols).
Due to property 1 (pk < pj ⇒ `k ≥ `j), if pk is the smallestprobability, then it must have a codeword length no less than anyother j with pj > pk. Similarly, if pk is second least probable, thenit has codeword length no less than any more probable symbol.
Thus, the two longest codewords have same length (prop 2) andcorrespond to two least likely source symbols.
If the two longest codewords are not siblings, we can swap them.I.e., if p1 ≥ p2 ≥ · · · ≥ pm then do the transformation:
pm
pm−1
pm
pm−1 di�ers onlyin last bit . . .
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F31/54(pg.97/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
This does not change the length L =∑
i pi`i.
Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.
So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.
We’ll continue doing this until the optimal code will be apparent.
Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1
Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.98/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
This does not change the length L =∑
i pi`i.
Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.
So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.
We’ll continue doing this until the optimal code will be apparent.
Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1
Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.99/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
This does not change the length L =∑
i pi`i.
Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.
So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.
We’ll continue doing this until the optimal code will be apparent.
Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1
Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.100/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
This does not change the length L =∑
i pi`i.
Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.
So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.
We’ll continue doing this until the optimal code will be apparent.
Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1
Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.101/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
This does not change the length L =∑
i pi`i.
Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.
So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.
We’ll continue doing this until the optimal code will be apparent.
Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1
Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.102/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
. . . proof of lemma 10.3.1.
This does not change the length L =∑
i pi`i.
Thus, if p1 ≥ p2 ≥ · · · ≥ pm, there exists an optimal code with`1 ≤ `2 ≤ · · · ≤ `m−1 = `m and where C(xm−1) and C(xm) differonly in last bit.
So, next we’re going to demonstrate how Huffman is optimal bystarting with a code and then doing a Huffman operation to producea new code, and where the optimization of the original code isdependent on a (simpler) optimization on a shorter code.
We’ll continue doing this until the optimal code will be apparent.
Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1
Huffman turns code Cm into code Cm−1 (with codewords {ω′i}m−1i=1 )
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F32/54(pg.103/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1
Indices m,m− 1 have the least probability and longest codewords.Cm length symb. prob
w1 `1 p1
w2 `2 p2...
......
wm−2 `m−2 pm−2
wm−1 `m−1 pm−1
wm `m pm
Huffman builds the code backwards, taking the two smallestprobabilities pm−1, pm, giving a bit (0 or 1) to each code word, andmerges passing the result back to another round of Huffman.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F33/54(pg.104/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1
Indices m,m− 1 have the least probability and longest codewords.Cm length symb. prob
w1 `1 p1
w2 `2 p2...
......
wm−2 `m−2 pm−2
wm−1 `m−1 pm−1
wm `m pm
Huffman builds the code backwards, taking the two smallestprobabilities pm−1, pm, giving a bit (0 or 1) to each code word, andmerges passing the result back to another round of Huffman.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F33/54(pg.105/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Assume (some not necessarily optimal) code Cm (on m symbols)that satisfies the above properties. Cm has codewords {ωi}mi=1
Indices m,m− 1 have the least probability and longest codewords.Cm length symb. prob
w1 `1 p1
w2 `2 p2...
......
wm−2 `m−2 pm−2
wm−1 `m−1 pm−1
wm `m pm
Huffman builds the code backwards, taking the two smallestprobabilities pm−1, pm, giving a bit (0 or 1) to each code word, andmerges passing the result back to another round of Huffman.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F33/54(pg.106/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Huffman implicitly goes from current code Cm to Cm−1 as follows:
symb. prob. Cm−1 m− 1 len. code rel. length relationship symb. prob
p1 ω′1 `′1 w1 = w′1 `1 = `′1 p1p2 ω′2 `′2 w2 = w′2 `2 = `′2 p2...
......
......
...
pm−2 ω′m−2 `′m−2 wm−2 = w′m−2 `m−2 = `′m−2 pm−2
pm−1 + pm ω′m−1 `′m−1 wm−1 = w′m−10 `m−1 = `′m−1 + 1 pm−1
wm = w′m−11 `m = `′m−1 + 1 pm
Again, ωi are the Cm lengths and ω′i are the Cm−1 lengths.
Lengths are defined recursively at the time of the Huffman step. AllHuffman knows is the relationship between the current lengths andcodewords (at step m) to the next lengths and codewords (at stepm− 1). Huffman is lazy in this way.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F34/54(pg.107/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Huffman implicitly goes from current code Cm to Cm−1 as follows:
symb. prob. Cm−1 m− 1 len. code rel. length relationship symb. prob
p1 ω′1 `′1 w1 = w′1 `1 = `′1 p1p2 ω′2 `′2 w2 = w′2 `2 = `′2 p2...
......
......
...
pm−2 ω′m−2 `′m−2 wm−2 = w′m−2 `m−2 = `′m−2 pm−2
pm−1 + pm ω′m−1 `′m−1 wm−1 = w′m−10 `m−1 = `′m−1 + 1 pm−1
wm = w′m−11 `m = `′m−1 + 1 pm
Again, ωi are the Cm lengths and ω′i are the Cm−1 lengths.
Lengths are defined recursively at the time of the Huffman step. AllHuffman knows is the relationship between the current lengths andcodewords (at step m) to the next lengths and codewords (at stepm− 1). Huffman is lazy in this way.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F34/54(pg.108/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
Huffman implicitly goes from current code Cm to Cm−1 as follows:
symb. prob. Cm−1 m− 1 len. code rel. length relationship symb. prob
p1 ω′1 `′1 w1 = w′1 `1 = `′1 p1p2 ω′2 `′2 w2 = w′2 `2 = `′2 p2...
......
......
...
pm−2 ω′m−2 `′m−2 wm−2 = w′m−2 `m−2 = `′m−2 pm−2
pm−1 + pm ω′m−1 `′m−1 wm−1 = w′m−10 `m−1 = `′m−1 + 1 pm−1
wm = w′m−11 `m = `′m−1 + 1 pm
Again, ωi are the Cm lengths and ω′i are the Cm−1 lengths.
Lengths are defined recursively at the time of the Huffman step. AllHuffman knows is the relationship between the current lengths andcodewords (at step m) to the next lengths and codewords (at stepm− 1). Huffman is lazy in this way.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F34/54(pg.109/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
We get the following:
L(Cm) =∑i
pi`i (10.9)
=
m−2∑i=1
pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)
=m−2∑i=1
pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)
=m−1∑i=1
p′i`′i + pm−1 + pm (10.12)
= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸
doesn’t involve lengths
(10.13)
Reduces num. of variables we need to optimize over.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.110/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
We get the following:
L(Cm)
=∑i
pi`i (10.9)
=
m−2∑i=1
pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)
=m−2∑i=1
pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)
=m−1∑i=1
p′i`′i + pm−1 + pm (10.12)
= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸
doesn’t involve lengths
(10.13)
Reduces num. of variables we need to optimize over.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.111/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
We get the following:
L(Cm) =∑i
pi`i (10.9)
=
m−2∑i=1
pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)
=m−2∑i=1
pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)
=m−1∑i=1
p′i`′i + pm−1 + pm (10.12)
= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸
doesn’t involve lengths
(10.13)
Reduces num. of variables we need to optimize over.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.112/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
We get the following:
L(Cm) =∑i
pi`i (10.9)
=
m−2∑i=1
pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)
=m−2∑i=1
pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)
=m−1∑i=1
p′i`′i + pm−1 + pm (10.12)
= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸
doesn’t involve lengths
(10.13)
Reduces num. of variables we need to optimize over.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.113/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
We get the following:
L(Cm) =∑i
pi`i (10.9)
=
m−2∑i=1
pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)
=
m−2∑i=1
pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)
=m−1∑i=1
p′i`′i + pm−1 + pm (10.12)
= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸
doesn’t involve lengths
(10.13)
Reduces num. of variables we need to optimize over.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.114/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
We get the following:
L(Cm) =∑i
pi`i (10.9)
=
m−2∑i=1
pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)
=
m−2∑i=1
pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)
=
m−1∑i=1
p′i`′i + pm−1 + pm (10.12)
= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸
doesn’t involve lengths
(10.13)
Reduces num. of variables we need to optimize over.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.115/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
We get the following:
L(Cm) =∑i
pi`i (10.9)
=
m−2∑i=1
pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)
=
m−2∑i=1
pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)
=
m−1∑i=1
p′i`′i + pm−1 + pm (10.12)
= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸
doesn’t involve lengths
(10.13)
Reduces num. of variables we need to optimize over.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.116/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
We get the following:
L(Cm) =∑i
pi`i (10.9)
=
m−2∑i=1
pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)
=
m−2∑i=1
pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)
=
m−1∑i=1
p′i`′i + pm−1 + pm (10.12)
= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸doesn’t involve lengths
(10.13)
Reduces num. of variables we need to optimize over.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.117/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
We get the following:
L(Cm) =∑i
pi`i (10.9)
=
m−2∑i=1
pi`′i + pm−1(`′m−1 + 1) + pm(`′m−1 + 1) (10.10)
=
m−2∑i=1
pi`′i + (pm−1 + pm)`′m−1 + pm−1 + pm (10.11)
=
m−1∑i=1
p′i`′i + pm−1 + pm (10.12)
= L(Cm−1) + pm−1 + pm︸ ︷︷ ︸doesn’t involve lengths
(10.13)
Reduces num. of variables we need to optimize over.Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F35/54(pg.118/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
So the Huffman procedure implies that:
min`1:m
L(Cm) = const. + min`1:m−1
L(Cm−1) = . . . (10.14)
= const. + min`1:2
L(C2) (10.15)
where each min step is Huffman, and each preserves the statedproperties.
This reduces down to a length-2 code, which is obvious to optimize(use one bit for each source symbol), and then we backtrack toconstruct the code.
Optimality is preserved at each backtrack step. We kept theproperties of the code, and reduced the problem to one having onlyone (obvious) solution.
Theorem 10.3.2
The Huffman coding procedure is an optimal integer code lengths code.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F36/54(pg.119/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
So the Huffman procedure implies that:
min`1:m
L(Cm) = const. + min`1:m−1
L(Cm−1) = . . . (10.14)
= const. + min`1:2
L(C2) (10.15)
where each min step is Huffman, and each preserves the statedproperties.
This reduces down to a length-2 code, which is obvious to optimize(use one bit for each source symbol), and then we backtrack toconstruct the code.
Optimality is preserved at each backtrack step. We kept theproperties of the code, and reduced the problem to one having onlyone (obvious) solution.
Theorem 10.3.2
The Huffman coding procedure is an optimal integer code lengths code.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F36/54(pg.120/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
So the Huffman procedure implies that:
min`1:m
L(Cm) = const. + min`1:m−1
L(Cm−1) = . . . (10.14)
= const. + min`1:2
L(C2) (10.15)
where each min step is Huffman, and each preserves the statedproperties.
This reduces down to a length-2 code, which is obvious to optimize(use one bit for each source symbol), and then we backtrack toconstruct the code.
Optimality is preserved at each backtrack step. We kept theproperties of the code, and reduced the problem to one having onlyone (obvious) solution.
Theorem 10.3.2
The Huffman coding procedure is an optimal integer code lengths code.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F36/54(pg.121/220)
Huffman Shannon/Fano/Elias Next
Optimality of Huffman
So the Huffman procedure implies that:
min`1:m
L(Cm) = const. + min`1:m−1
L(Cm−1) = . . . (10.14)
= const. + min`1:2
L(C2) (10.15)
where each min step is Huffman, and each preserves the statedproperties.
This reduces down to a length-2 code, which is obvious to optimize(use one bit for each source symbol), and then we backtrack toconstruct the code.
Optimality is preserved at each backtrack step. We kept theproperties of the code, and reduced the problem to one having onlyone (obvious) solution.
Theorem 10.3.2
The Huffman coding procedure is an optimal integer code lengths code.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F36/54(pg.122/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman coding is a symbol code, we code one symbol at a time.
Is Huffman optimal?
But what does optimal mean?
In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.
This is ok for D-adic distributions but could use up to one extra bitper symbol on average.
Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.
Thus, we need a long block to get any benefit.
In practice, this means we need to store and be able to computep(x1:n).
No problem, right?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.123/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman coding is a symbol code, we code one symbol at a time.
Is Huffman optimal?
But what does optimal mean?
In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.
This is ok for D-adic distributions but could use up to one extra bitper symbol on average.
Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.
Thus, we need a long block to get any benefit.
In practice, this means we need to store and be able to computep(x1:n).
No problem, right?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.124/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman coding is a symbol code, we code one symbol at a time.
Is Huffman optimal? But what does optimal mean?
In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.
This is ok for D-adic distributions but could use up to one extra bitper symbol on average.
Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.
Thus, we need a long block to get any benefit.
In practice, this means we need to store and be able to computep(x1:n).
No problem, right?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.125/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman coding is a symbol code, we code one symbol at a time.
Is Huffman optimal? But what does optimal mean?
In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.
This is ok for D-adic distributions but could use up to one extra bitper symbol on average.
Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.
Thus, we need a long block to get any benefit.
In practice, this means we need to store and be able to computep(x1:n).
No problem, right?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.126/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman coding is a symbol code, we code one symbol at a time.
Is Huffman optimal? But what does optimal mean?
In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.
This is ok for D-adic distributions but could use up to one extra bitper symbol on average.
Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.
Thus, we need a long block to get any benefit.
In practice, this means we need to store and be able to computep(x1:n).
No problem, right?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.127/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman coding is a symbol code, we code one symbol at a time.
Is Huffman optimal? But what does optimal mean?
In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.
This is ok for D-adic distributions but could use up to one extra bitper symbol on average.
Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.
Thus, we need a long block to get any benefit.
In practice, this means we need to store and be able to computep(x1:n).
No problem, right?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.128/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman coding is a symbol code, we code one symbol at a time.
Is Huffman optimal? But what does optimal mean?
In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.
This is ok for D-adic distributions but could use up to one extra bitper symbol on average.
Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.
Thus, we need a long block to get any benefit.
In practice, this means we need to store and be able to computep(x1:n).
No problem, right?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.129/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman coding is a symbol code, we code one symbol at a time.
Is Huffman optimal? But what does optimal mean?
In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.
This is ok for D-adic distributions but could use up to one extra bitper symbol on average.
Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.
Thus, we need a long block to get any benefit.
In practice, this means we need to store and be able to computep(x1:n).
No problem, right?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.130/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman coding is a symbol code, we code one symbol at a time.
Is Huffman optimal? But what does optimal mean?
In general, for a symbol code, each symbol n the source alphabetmust use an integer number of codeword bits.
This is ok for D-adic distributions but could use up to one extra bitper symbol on average.
Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so weshould be using close to zero bits per symbol to code this, butHuffman uses 1.
Thus, we need a long block to get any benefit.
In practice, this means we need to store and be able to computep(x1:n). No problem, right?
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F37/54(pg.131/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Can we easily compute p(x1:n)?
If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.
Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).
Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine?
“dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found
Smoothing models are required. Similar to the language modelproblem in natural language processing.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.132/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Can we easily compute p(x1:n)?
If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.
Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).
Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine?
“dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found
Smoothing models are required. Similar to the language modelproblem in natural language processing.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.133/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Can we easily compute p(x1:n)?
If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.
Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).
Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine?
“dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found
Smoothing models are required. Similar to the language modelproblem in natural language processing.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.134/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Can we easily compute p(x1:n)?
If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.
Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).
Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine?
“dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found
Smoothing models are required. Similar to the language modelproblem in natural language processing.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.135/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Can we easily compute p(x1:n)?
If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.
Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).
Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine? “dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.
On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found
Smoothing models are required. Similar to the language modelproblem in natural language processing.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.136/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Can we easily compute p(x1:n)?
If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.
Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).
Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine? “dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found
Smoothing models are required. Similar to the language modelproblem in natural language processing.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.137/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Can we easily compute p(x1:n)?
If |A| is the alphabet size, we need a table of size |A|n to store theseprobabilities.
Moreover, it is hard to estimate p(x1:n) accurately. Given anamount of “training data” (to borrow a phrase from machinelearning), it is hard to estimate this distribution. Many of thepossible strings in any finite sample size will not occur (sparsity).
Example: how hard is it to find a short grammatically valid Englishprhase never before written using a web search engine? “dogs atebanks on the river” is not found as of Mon, Oct 28, 2013.On Oct30th, 2019 it is found but only on a site that sells slides from otheruniversity classes. “dogs drank banks on the river” is not found
Smoothing models are required. Similar to the language modelproblem in natural language processing.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F38/54(pg.138/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman has the property that
H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)
Bigger block sizes help, but we get
H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)
for the block.
If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.
If block gets too long, we have the estimation problem again (hardto compute p(x1:n),
also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.139/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman has the property that
H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)
Bigger block sizes help, but we get
H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)
for the block.
If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.
If block gets too long, we have the estimation problem again (hardto compute p(x1:n),
also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.140/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman has the property that
H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)
Bigger block sizes help, but we get
H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)
for the block.
If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.
If block gets too long, we have the estimation problem again (hardto compute p(x1:n),
also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.141/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman has the property that
H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)
Bigger block sizes help, but we get
H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)
for the block.
If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.
If block gets too long, we have the estimation problem again (hardto compute p(x1:n),
also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.142/220)
Huffman Shannon/Fano/Elias Next
Huffman Codes
Huffman has the property that
H(X) ≤ L(Huffman) ≤ H(X) + 1 (10.16)
Bigger block sizes help, but we get
H(X1:n) ≤ L(Block Huffman) ≤ H(X1:n) + 1 (10.17)
for the block.
If H(X1:n) is small (e.g., English text) then this extra bit can besignificant.
If block gets too long, we have the estimation problem again (hardto compute p(x1:n),
also the fact that it introduces latencies (we need to encode andthen wait for the end of a block before we can send any bits).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F39/54(pg.143/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
There are other good symbol coding schemes as well.
In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words
Understanding this will be useful to understand arithmetic coding.
Again, in this case, we have full access to p(x).
X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.144/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
There are other good symbol coding schemes as well.
In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words
Understanding this will be useful to understand arithmetic coding.
Again, in this case, we have full access to p(x).
X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.145/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
There are other good symbol coding schemes as well.
In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words
Understanding this will be useful to understand arithmetic coding.
Again, in this case, we have full access to p(x).
X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.146/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
There are other good symbol coding schemes as well.
In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words
Understanding this will be useful to understand arithmetic coding.
Again, in this case, we have full access to p(x).
X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.147/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
There are other good symbol coding schemes as well.
In Shannon/Fano/Elias Coding, we use the cumulative distributionto compute the bits of the code words
Understanding this will be useful to understand arithmetic coding.
Again, in this case, we have full access to p(x).
X = {1, 2, . . . ,m} with p(x) > 0, so all probabilities are strictlypositive (if not, remove zero probability symbols).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F40/54(pg.148/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Define F (x) =∑
a≤x p(a)
1 2 3 4
p(1)
p(2)
p(3)
p(4)
...
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F41/54(pg.149/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Define F (x) ,∑a<x
p(a) +1
2p(x) (10.18)
= F (x)− 1
2p(x) (10.19)
1 2 3 4
p(1)
p(2)
p(3)
p(4)
...F (x)
F (x)
F (x) is the point between F (x− 1) and F (x) so since p(x) > 0,
F (x− 1) < F (x) < F (x) (10.20)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F42/54(pg.150/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Define F (x) ,∑a<x
p(a) +1
2p(x) (10.18)
= F (x)− 1
2p(x) (10.19)
1 2 3 4
p(1)
p(2)
p(3)
p(4)
...F (x)
F (x)
F (x) is the point between F (x− 1) and F (x) so since p(x) > 0,
F (x− 1) < F (x) < F (x) (10.20)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F42/54(pg.151/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b
⇒ F (a) 6= F (b)⇔ F (a) 6= F (b)
.
So we can use F (a) as a non-singular code for a
(binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why?
But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).
E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.152/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)
⇔ F (a) 6= F (b)
.
So we can use F (a) as a non-singular code for a
(binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why?
But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).
E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.153/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).
So we can use F (a) as a non-singular code for a
(binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why?
But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).
E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.154/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).
So we can use F (a) as a non-singular code for a
(binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why?
But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).
E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.155/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).
So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why?
But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).
E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.156/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).
So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why?
But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).
E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.157/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).
So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).
E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.158/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).
So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x).
E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.159/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).
So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x). E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.160/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).
So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x). E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.161/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Since p(x) > 0, a 6= b⇒ F (a) 6= F (b)⇔ F (a) 6= F (b).
So we can use F (a) as a non-singular code for a (binary expansionafter binary point as we saw earlier in the proof of Kraft forcountably infinite lengths)
code will be uniquely decodable, why? But this code is long, somecodewords could be infinite length.
Hence, truncate off F (x) to `(x) bits, notated bF (x)c`(x). E.g.,` = 4, and F (x) = 0.01100100100 . . . , then bF (x)c`(x) = 0.0110.
How long to make `(x) to retain unique decodability?
F (x)− bF (x)c`(x) <1
2`(x)(10.21)
Example: When ` = 4, we have
code0.xxxx xxxx0.xxxx 0000-
=<
0.0000 xxxx0.0001 0000 =
�F (x)�4
1/2�(x)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F43/54(pg.162/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
If `(x) = dlog 1/p(x)e+ 1, then
1
2`(x)=
1
22−dlog 1/p(x)e ≤ 1
22− log 1/p(x) =
p(x)
2(10.22)
= F (x)− F (x− 1)
(10.23)
(10.24)
giving
F (x)− bF (x)c`(x) <1
2`(x)< F (x)− F (x− 1) (10.25)
⇒ bF (x)c`(x) > F (x− 1)
(10.26)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.163/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
If `(x) = dlog 1/p(x)e+ 1, then
1
2`(x)
=1
22−dlog 1/p(x)e ≤ 1
22− log 1/p(x) =
p(x)
2(10.22)
= F (x)− F (x− 1)
(10.23)
(10.24)
giving
F (x)− bF (x)c`(x) <1
2`(x)< F (x)− F (x− 1) (10.25)
⇒ bF (x)c`(x) > F (x− 1)
(10.26)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.164/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
If `(x) = dlog 1/p(x)e+ 1, then
1
2`(x)=
1
22−dlog 1/p(x)e
≤ 1
22− log 1/p(x) =
p(x)
2(10.22)
= F (x)− F (x− 1)
(10.23)
(10.24)
giving
F (x)− bF (x)c`(x) <1
2`(x)< F (x)− F (x− 1) (10.25)
⇒ bF (x)c`(x) > F (x− 1)
(10.26)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.165/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
If `(x) = dlog 1/p(x)e+ 1, then
1
2`(x)=
1
22−dlog 1/p(x)e ≤ 1
22− log 1/p(x)
=p(x)
2(10.22)
= F (x)− F (x− 1)
(10.23)
(10.24)
giving
F (x)− bF (x)c`(x) <1
2`(x)< F (x)− F (x− 1) (10.25)
⇒ bF (x)c`(x) > F (x− 1)
(10.26)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.166/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
If `(x) = dlog 1/p(x)e+ 1, then
1
2`(x)=
1
22−dlog 1/p(x)e ≤ 1
22− log 1/p(x) =
p(x)
2(10.22)
= F (x)− F (x− 1)
(10.23)
(10.24)
giving
F (x)− bF (x)c`(x) <1
2`(x)< F (x)− F (x− 1) (10.25)
⇒ bF (x)c`(x) > F (x− 1)
(10.26)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.167/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
If `(x) = dlog 1/p(x)e+ 1, then
1
2`(x)=
1
22−dlog 1/p(x)e ≤ 1
22− log 1/p(x) =
p(x)
2(10.22)
= F (x)− F (x− 1) (10.23)
(10.24)
giving
F (x)− bF (x)c`(x) <1
2`(x)< F (x)− F (x− 1) (10.25)
⇒ bF (x)c`(x) > F (x− 1)
(10.26)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.168/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
If `(x) = dlog 1/p(x)e+ 1, then
1
2`(x)=
1
22−dlog 1/p(x)e ≤ 1
22− log 1/p(x) =
p(x)
2(10.22)
= F (x)− F (x− 1) (10.23)
(10.24)
giving
F (x)− bF (x)c`(x) <1
2`(x)< F (x)− F (x− 1) (10.25)
⇒ bF (x)c`(x) > F (x− 1)
(10.26)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.169/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
If `(x) = dlog 1/p(x)e+ 1, then
1
2`(x)=
1
22−dlog 1/p(x)e ≤ 1
22− log 1/p(x) =
p(x)
2(10.22)
= F (x)− F (x− 1) (10.23)
(10.24)
giving
F (x)− bF (x)c`(x) <1
2`(x)< F (x)− F (x− 1) (10.25)
⇒ bF (x)c`(x) > F (x− 1)
(10.26)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.170/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
If `(x) = dlog 1/p(x)e+ 1, then
1
2`(x)=
1
22−dlog 1/p(x)e ≤ 1
22− log 1/p(x) =
p(x)
2(10.22)
= F (x)− F (x− 1) (10.23)
(10.24)
giving
F (x)− bF (x)c`(x) <1
2`(x)< F (x)− F (x− 1) (10.25)
⇒ bF (x)c`(x) > F (x− 1) (10.26)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F44/54(pg.171/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
This gives
F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)
And thus bF (x)c`(x) bits is sufficient to use to describe xunambiguously if we use `(x) = dlog 1/p(x)e+ 1 bits.
Is this code prefix free?
Consider codeword z1z2 . . . z` to correspond to half-open interval
[ bF (x)c`(x)︷ ︸︸ ︷0.z1z2z3 . . . z`,
bF (x)c`(x)︷ ︸︸ ︷0.z1z2 . . . z1 +1/2`
)(10.28)
=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)
+0.00 . . . 1)
(10.30)
which has length 1/2` (all bin. numbers that start with 0.z1z2 . . . z`).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F45/54(pg.172/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
This gives
F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)
And thus bF (x)c`(x) bits is sufficient to use to describe xunambiguously if we use `(x) = dlog 1/p(x)e+ 1 bits.
Is this code prefix free?
Consider codeword z1z2 . . . z` to correspond to half-open interval
[ bF (x)c`(x)︷ ︸︸ ︷0.z1z2z3 . . . z`,
bF (x)c`(x)︷ ︸︸ ︷0.z1z2 . . . z1 +1/2`
)(10.28)
=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)
+0.00 . . . 1)
(10.30)
which has length 1/2` (all bin. numbers that start with 0.z1z2 . . . z`).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F45/54(pg.173/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
This gives
F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)
And thus bF (x)c`(x) bits is sufficient to use to describe xunambiguously if we use `(x) = dlog 1/p(x)e+ 1 bits.
Is this code prefix free?
Consider codeword z1z2 . . . z` to correspond to half-open interval
[ bF (x)c`(x)︷ ︸︸ ︷0.z1z2z3 . . . z`,
bF (x)c`(x)︷ ︸︸ ︷0.z1z2 . . . z1 +1/2`
)(10.28)
=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)
+0.00 . . . 1)
(10.30)
which has length 1/2` (all bin. numbers that start with 0.z1z2 . . . z`).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F45/54(pg.174/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
This gives
F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) (10.27)
And thus bF (x)c`(x) bits is sufficient to use to describe xunambiguously if we use `(x) = dlog 1/p(x)e+ 1 bits.
Is this code prefix free?
Consider codeword z1z2 . . . z` to correspond to half-open interval
[ bF (x)c`(x)︷ ︸︸ ︷0.z1z2z3 . . . z`,
bF (x)c`(x)︷ ︸︸ ︷0.z1z2 . . . z1 +1/2`
)(10.28)
=[0.z1z2z3 . . . z`, 0.z1z2 . . . z` (10.29)
+0.00 . . . 1)
(10.30)
which has length 1/2` (all bin. numbers that start with 0.z1z2 . . . z`).
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F45/54(pg.175/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Viewing F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) and interval with
length 1/2`
F (x− 1) F (x) F (x)
�F (x)��(x)
possible values of the truncation
1/2�half-open interval
That is, bF (x)c`(x) ∈ (F (x− 1), F (x)], lives in the half openinterval.
But 2−`(x) ≤ p(x)/2 and F (x− 1) < bF (x)c`(x) ≤ F (x), so theopen intervals are disjoint even if bF (x)c`(x) = F (x).
thus, we have a prefix free code (i.e., if bF (x)c`(x) was a prefix ofanother codeword, that codeword would live in bF (x)c`(x)’s interval,but no other codeword does such a thing)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F46/54(pg.176/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Viewing F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) and interval with
length 1/2`
F (x− 1) F (x) F (x)
�F (x)��(x)
possible values of the truncation
1/2�half-open interval
That is, bF (x)c`(x) ∈ (F (x− 1), F (x)], lives in the half openinterval.
But 2−`(x) ≤ p(x)/2 and F (x− 1) < bF (x)c`(x) ≤ F (x), so theopen intervals are disjoint even if bF (x)c`(x) = F (x).
thus, we have a prefix free code (i.e., if bF (x)c`(x) was a prefix ofanother codeword, that codeword would live in bF (x)c`(x)’s interval,but no other codeword does such a thing)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F46/54(pg.177/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Viewing F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) and interval with
length 1/2`
F (x− 1) F (x) F (x)
�F (x)��(x)
possible values of the truncation
1/2�half-open interval
That is, bF (x)c`(x) ∈ (F (x− 1), F (x)], lives in the half openinterval.
But 2−`(x) ≤ p(x)/2 and F (x− 1) < bF (x)c`(x) ≤ F (x), so theopen intervals are disjoint even if bF (x)c`(x) = F (x).
thus, we have a prefix free code (i.e., if bF (x)c`(x) was a prefix ofanother codeword, that codeword would live in bF (x)c`(x)’s interval,but no other codeword does such a thing)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F46/54(pg.178/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Viewing F (x− 1) < bF (x)c`(x) ≤ F (x) < F (x) and interval with
length 1/2`
F (x− 1) F (x) F (x)
�F (x)��(x)
possible values of the truncation
1/2�half-open interval
That is, bF (x)c`(x) ∈ (F (x− 1), F (x)], lives in the half openinterval.
But 2−`(x) ≤ p(x)/2 and F (x− 1) < bF (x)c`(x) ≤ F (x), so theopen intervals are disjoint even if bF (x)c`(x) = F (x).
thus, we have a prefix free code (i.e., if bF (x)c`(x) was a prefix ofanother codeword, that codeword would live in bF (x)c`(x)’s interval,but no other codeword does such a thing)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F46/54(pg.179/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
So `(x) = dlog 1/p(x)e+ 1 suffices, and we have expected length
L =∑x
p(x)`(x) =∑x
p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2
(10.31)
Ex: dyadicx p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.5 0.75 0.5 0.10 2 103 0.125 0.875 0.8125 0.1101 4 11014 0.125 1.0 0.9375 0.111 4 1111
E` = 2.75 bits, while H = 1.75 bits.
On the other hand, Huffman achieves the entropy rate: Huffmantree (((3,4),1),2) getsE`huffman = 0.25× 1 + 0.25× 2 + 0.125× 3 + 0.125× 3 = 1.75
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F47/54(pg.180/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
So `(x) = dlog 1/p(x)e+ 1 suffices, and we have expected length
L =∑x
p(x)`(x) =∑x
p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2
(10.31)
Ex: dyadicx p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.5 0.75 0.5 0.10 2 103 0.125 0.875 0.8125 0.1101 4 11014 0.125 1.0 0.9375 0.111 4 1111
E` = 2.75 bits, while H = 1.75 bits.
On the other hand, Huffman achieves the entropy rate: Huffmantree (((3,4),1),2) getsE`huffman = 0.25× 1 + 0.25× 2 + 0.125× 3 + 0.125× 3 = 1.75
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F47/54(pg.181/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
So `(x) = dlog 1/p(x)e+ 1 suffices, and we have expected length
L =∑x
p(x)`(x) =∑x
p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2
(10.31)
Ex: dyadicx p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.5 0.75 0.5 0.10 2 103 0.125 0.875 0.8125 0.1101 4 11014 0.125 1.0 0.9375 0.111 4 1111
E` = 2.75 bits, while H = 1.75 bits.
On the other hand, Huffman achieves the entropy rate: Huffmantree (((3,4),1),2) getsE`huffman = 0.25× 1 + 0.25× 2 + 0.125× 3 + 0.125× 3 = 1.75
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F47/54(pg.182/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
So `(x) = dlog 1/p(x)e+ 1 suffices, and we have expected length
L =∑x
p(x)`(x) =∑x
p(x)(dlog 1/p(x)e+ 1) ≤ H(X) + 2
(10.31)
Ex: dyadicx p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.5 0.75 0.5 0.10 2 103 0.125 0.875 0.8125 0.1101 4 11014 0.125 1.0 0.9375 0.111 4 1111
E` = 2.75 bits, while H = 1.75 bits.
On the other hand, Huffman achieves the entropy rate: Huffmantree (((3,4),1),2) getsE`huffman = 0.25× 1 + 0.25× 2 + 0.125× 3 + 0.125× 3 = 1.75
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F47/54(pg.183/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Ex: non-dyadic. Repeated digits: e.g., let 0.01010101 = 0.01x p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.25 0.5 0.375 0.011 3 0113 0.2 0.7 0.6 0.10011 4 10014 0.15 0.85 0.775 0.1100011 4 11005 0.15 1 0.925 0.1110110 4 1110
Again, not optimal H = 2.285, E` = 3.5, while E`huffman = 2.3,with Huffman tree ((1,(4,5)),(3,2))
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F48/54(pg.184/220)
Huffman Shannon/Fano/Elias Next
Shannon/Fano/Elias Coding
Ex: non-dyadic. Repeated digits: e.g., let 0.01010101 = 0.01x p(x) F (x) F (x) F (x) binary `(x) codeword1 0.25 0.25 0.125 0.001 3 0012 0.25 0.5 0.375 0.011 3 0113 0.2 0.7 0.6 0.10011 4 10014 0.15 0.85 0.775 0.1100011 4 11005 0.15 1 0.925 0.1110110 4 1110
Again, not optimal H = 2.285, E` = 3.5, while E`huffman = 2.3,with Huffman tree ((1,(4,5)),(3,2))
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F48/54(pg.185/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
On a particular code word, sometimes Shannon length is better thanHuffman and sometimes not (of course, on average Huffman isbetter).
Q: How likely is it that any other uniquely decodable code is shorterthan Shannon code for a particular codeword (we do Shannon onlybecause it relatively easy to analyze unlike Huffman lengths whichare define algorithmically and are thus harder to bound).
Theorem 10.4.1
Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code.
Then
Pr(`(X) ≥ `′(X) + c
)≤ 1
2c−1(10.32)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F49/54(pg.186/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
On a particular code word, sometimes Shannon length is better thanHuffman and sometimes not (of course, on average Huffman isbetter).
Q: How likely is it that any other uniquely decodable code is shorterthan Shannon code for a particular codeword (we do Shannon onlybecause it relatively easy to analyze unlike Huffman lengths whichare define algorithmically and are thus harder to bound).
Theorem 10.4.1
Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code.
Then
Pr(`(X) ≥ `′(X) + c
)≤ 1
2c−1(10.32)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F49/54(pg.187/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
On a particular code word, sometimes Shannon length is better thanHuffman and sometimes not (of course, on average Huffman isbetter).
Q: How likely is it that any other uniquely decodable code is shorterthan Shannon code for a particular codeword (we do Shannon onlybecause it relatively easy to analyze unlike Huffman lengths whichare define algorithmically and are thus harder to bound).
Theorem 10.4.1
Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code.
Then
Pr(`(X) ≥ `′(X) + c
)≤ 1
2c−1(10.32)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F49/54(pg.188/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
On a particular code word, sometimes Shannon length is better thanHuffman and sometimes not (of course, on average Huffman isbetter).
Q: How likely is it that any other uniquely decodable code is shorterthan Shannon code for a particular codeword (we do Shannon onlybecause it relatively easy to analyze unlike Huffman lengths whichare define algorithmically and are thus harder to bound).
Theorem 10.4.1
Let `(x) be codeword lengths of Shannon code and `′(x) be codewordlengths if any other uniquely decodable code. Then
Pr(`(X) ≥ `′(X) + c
)≤ 1
2c−1(10.32)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F49/54(pg.189/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
proof of Theorem 10.4.1.
Pr(`(X) ≥ `′(X) + c
)
= Pr(dlog 1/p(X)e ≥ `′(X) + c
)(10.33)
≤ Pr(
log 1/p(X) ≥ `′(X) + c− 1)
(10.34)
= Pr(p(X) ≤ 2−`
′(X)−c+1)
(10.35)
=∑
x:p(x)≤2−`(x)−c+1
p(x) (10.36)
≤∑
x:p(x)≤2−`(x)−c+1
2−`′(x)−c+1 (10.37)
≤∑x
2−`′(x)2−(c−1) (10.38)
≤ 2−(c−1) since∑x
2−`′(x) ≤ 1 by Kraft
(10.39)
thus, no code does better than Shannon most of the time.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.190/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
proof of Theorem 10.4.1.
Pr(`(X) ≥ `′(X) + c
)= Pr
(dlog 1/p(X)e ≥ `′(X) + c
)(10.33)
≤ Pr(
log 1/p(X) ≥ `′(X) + c− 1)
(10.34)
= Pr(p(X) ≤ 2−`
′(X)−c+1)
(10.35)
=∑
x:p(x)≤2−`(x)−c+1
p(x) (10.36)
≤∑
x:p(x)≤2−`(x)−c+1
2−`′(x)−c+1 (10.37)
≤∑x
2−`′(x)2−(c−1) (10.38)
≤ 2−(c−1) since∑x
2−`′(x) ≤ 1 by Kraft
(10.39)
thus, no code does better than Shannon most of the time.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.191/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
proof of Theorem 10.4.1.
Pr(`(X) ≥ `′(X) + c
)= Pr
(dlog 1/p(X)e ≥ `′(X) + c
)(10.33)
≤ Pr(
log 1/p(X) ≥ `′(X) + c− 1)
(10.34)
= Pr(p(X) ≤ 2−`
′(X)−c+1)
(10.35)
=∑
x:p(x)≤2−`(x)−c+1
p(x) (10.36)
≤∑
x:p(x)≤2−`(x)−c+1
2−`′(x)−c+1 (10.37)
≤∑x
2−`′(x)2−(c−1) (10.38)
≤ 2−(c−1) since∑x
2−`′(x) ≤ 1 by Kraft
(10.39)
thus, no code does better than Shannon most of the time.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.192/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
proof of Theorem 10.4.1.
Pr(`(X) ≥ `′(X) + c
)= Pr
(dlog 1/p(X)e ≥ `′(X) + c
)(10.33)
≤ Pr(
log 1/p(X) ≥ `′(X) + c− 1)
(10.34)
= Pr(p(X) ≤ 2−`
′(X)−c+1)
(10.35)
=∑
x:p(x)≤2−`(x)−c+1
p(x) (10.36)
≤∑
x:p(x)≤2−`(x)−c+1
2−`′(x)−c+1 (10.37)
≤∑x
2−`′(x)2−(c−1) (10.38)
≤ 2−(c−1) since∑x
2−`′(x) ≤ 1 by Kraft
(10.39)
thus, no code does better than Shannon most of the time.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.193/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
proof of Theorem 10.4.1.
Pr(`(X) ≥ `′(X) + c
)= Pr
(dlog 1/p(X)e ≥ `′(X) + c
)(10.33)
≤ Pr(
log 1/p(X) ≥ `′(X) + c− 1)
(10.34)
= Pr(p(X) ≤ 2−`
′(X)−c+1)
(10.35)
=∑
x:p(x)≤2−`(x)−c+1
p(x) (10.36)
≤∑
x:p(x)≤2−`(x)−c+1
2−`′(x)−c+1 (10.37)
≤∑x
2−`′(x)2−(c−1) (10.38)
≤ 2−(c−1) since∑x
2−`′(x) ≤ 1 by Kraft
(10.39)
thus, no code does better than Shannon most of the time.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.194/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
proof of Theorem 10.4.1.
Pr(`(X) ≥ `′(X) + c
)= Pr
(dlog 1/p(X)e ≥ `′(X) + c
)(10.33)
≤ Pr(
log 1/p(X) ≥ `′(X) + c− 1)
(10.34)
= Pr(p(X) ≤ 2−`
′(X)−c+1)
(10.35)
=∑
x:p(x)≤2−`(x)−c+1
p(x) (10.36)
≤∑
x:p(x)≤2−`(x)−c+1
2−`′(x)−c+1 (10.37)
≤∑x
2−`′(x)2−(c−1) (10.38)
≤ 2−(c−1) since∑x
2−`′(x) ≤ 1 by Kraft
(10.39)
thus, no code does better than Shannon most of the time.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.195/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
proof of Theorem 10.4.1.
Pr(`(X) ≥ `′(X) + c
)= Pr
(dlog 1/p(X)e ≥ `′(X) + c
)(10.33)
≤ Pr(
log 1/p(X) ≥ `′(X) + c− 1)
(10.34)
= Pr(p(X) ≤ 2−`
′(X)−c+1)
(10.35)
=∑
x:p(x)≤2−`(x)−c+1
p(x) (10.36)
≤∑
x:p(x)≤2−`(x)−c+1
2−`′(x)−c+1 (10.37)
≤∑x
2−`′(x)2−(c−1) (10.38)
≤ 2−(c−1) since∑x
2−`′(x) ≤ 1 by Kraft
(10.39)
thus, no code does better than Shannon most of the time.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.196/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
proof of Theorem 10.4.1.
Pr(`(X) ≥ `′(X) + c
)= Pr
(dlog 1/p(X)e ≥ `′(X) + c
)(10.33)
≤ Pr(
log 1/p(X) ≥ `′(X) + c− 1)
(10.34)
= Pr(p(X) ≤ 2−`
′(X)−c+1)
(10.35)
=∑
x:p(x)≤2−`(x)−c+1
p(x) (10.36)
≤∑
x:p(x)≤2−`(x)−c+1
2−`′(x)−c+1 (10.37)
≤∑x
2−`′(x)2−(c−1) (10.38)
≤ 2−(c−1) since∑x
2−`′(x) ≤ 1 by Kraft
(10.39)
thus, no code does better than Shannon most of the time.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.197/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
proof of Theorem 10.4.1.
Pr(`(X) ≥ `′(X) + c
)= Pr
(dlog 1/p(X)e ≥ `′(X) + c
)(10.33)
≤ Pr(
log 1/p(X) ≥ `′(X) + c− 1)
(10.34)
= Pr(p(X) ≤ 2−`
′(X)−c+1)
(10.35)
=∑
x:p(x)≤2−`(x)−c+1
p(x) (10.36)
≤∑
x:p(x)≤2−`(x)−c+1
2−`′(x)−c+1 (10.37)
≤∑x
2−`′(x)2−(c−1) (10.38)
≤ 2−(c−1) since∑x
2−`′(x) ≤ 1 by Kraft
(10.39)
thus, no code does better than Shannon most of the time.Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F50/54(pg.198/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that
`(x) < `′(x) more often than `(x) > `′(x) (10.40)
where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.
Q: can this be true for all distributions?
A: No, since Huffman isbetter.
Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.
In fact, we have
Theorem 10.4.2
For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then
Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)(10.41)
with equality iff `′(x) = `(x) ∀x.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.199/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that
`(x) < `′(x) more often than `(x) > `′(x) (10.40)
where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions?
A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.
In fact, we have
Theorem 10.4.2
For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then
Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)(10.41)
with equality iff `′(x) = `(x) ∀x.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.200/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that
`(x) < `′(x) more often than `(x) > `′(x) (10.40)
where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.
Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.
In fact, we have
Theorem 10.4.2
For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then
Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)(10.41)
with equality iff `′(x) = `(x) ∀x.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.201/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that
`(x) < `′(x) more often than `(x) > `′(x) (10.40)
where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer.
In fact, we have
Theorem 10.4.2
For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then
Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)(10.41)
with equality iff `′(x) = `(x) ∀x.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.202/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that
`(x) < `′(x) more often than `(x) > `′(x) (10.40)
where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer. In fact, we have
Theorem 10.4.2
For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then
Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)(10.41)
with equality iff `′(x) = `(x) ∀x.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.203/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon codeBut we’d also like to have that Shannon code lengths are shortermore (probabilistically) often. I.e., that
`(x) < `′(x) more often than `(x) > `′(x) (10.40)
where `(x) are Shannon lengths, and `′(x) is any other (prefix) code.Q: can this be true for all distributions? A: No, since Huffman isbetter.Shannon coding is optimal for appropriately dyadic distributions,since in such case log 1/p(x) is an integer. In fact, we have
Theorem 10.4.2
For dyadic p(x), `(x) = log 1/p(x), and `′(x) the lengths of any otherprefix code, then
Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)(10.41)
with equality iff `′(x) = `(x) ∀x.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F51/54(pg.204/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.
Let sign(t) =
1 t > 0
0 t = 0
−1 t < 0
Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . .
This gives:
Pr(`′(X) < `(X)
)− Pr
(`′(X) > `(X)
)(10.42)
=∑
x:`′(x)≤`(x)
p(x)−∑
x:`′(x)>`(x)
p(x) (10.43)
=∑x
p(x) sign(`(x)− `′(x)
)(10.44)
= E[sign
(`(X)− `′(X)
)](10.45)
≤∑x
p(x)(
2`(x)−`′(x) − 1)
(10.46)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.205/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.
Let sign(t) =
1 t > 0
0 t = 0
−1 t < 0
Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . .
This gives:
Pr(`′(X) < `(X)
)− Pr
(`′(X) > `(X)
)(10.42)
=∑
x:`′(x)≤`(x)
p(x)−∑
x:`′(x)>`(x)
p(x) (10.43)
=∑x
p(x) sign(`(x)− `′(x)
)(10.44)
= E[sign
(`(X)− `′(X)
)](10.45)
≤∑x
p(x)(
2`(x)−`′(x) − 1)
(10.46)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.206/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.
Let sign(t) =
1 t > 0
0 t = 0
−1 t < 0
Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:
Pr(`′(X) < `(X)
)− Pr
(`′(X) > `(X)
)(10.42)
=∑
x:`′(x)≤`(x)
p(x)−∑
x:`′(x)>`(x)
p(x) (10.43)
=∑x
p(x) sign(`(x)− `′(x)
)(10.44)
= E[sign
(`(X)− `′(X)
)](10.45)
≤∑x
p(x)(
2`(x)−`′(x) − 1)
(10.46)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.207/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.
Let sign(t) =
1 t > 0
0 t = 0
−1 t < 0
Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:
Pr(`′(X) < `(X)
)− Pr
(`′(X) > `(X)
)(10.42)
=∑
x:`′(x)≤`(x)
p(x)−∑
x:`′(x)>`(x)
p(x) (10.43)
=∑x
p(x) sign(`(x)− `′(x)
)(10.44)
= E[sign
(`(X)− `′(X)
)](10.45)
≤∑x
p(x)(
2`(x)−`′(x) − 1)
(10.46)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.208/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.
Let sign(t) =
1 t > 0
0 t = 0
−1 t < 0
Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:
Pr(`′(X) < `(X)
)− Pr
(`′(X) > `(X)
)(10.42)
=∑
x:`′(x)≤`(x)
p(x)−∑
x:`′(x)>`(x)
p(x) (10.43)
=∑x
p(x) sign(`(x)− `′(x)
)(10.44)
= E[sign
(`(X)− `′(X)
)](10.45)
≤∑x
p(x)(
2`(x)−`′(x) − 1)
(10.46)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.209/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.
Let sign(t) =
1 t > 0
0 t = 0
−1 t < 0
Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:
Pr(`′(X) < `(X)
)− Pr
(`′(X) > `(X)
)(10.42)
=∑
x:`′(x)≤`(x)
p(x)−∑
x:`′(x)>`(x)
p(x) (10.43)
=∑x
p(x) sign(`(x)− `′(x)
)(10.44)
= E[sign
(`(X)− `′(X)
)](10.45)
≤∑x
p(x)(
2`(x)−`′(x) − 1)
(10.46)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.210/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.
Let sign(t) =
1 t > 0
0 t = 0
−1 t < 0
Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:
Pr(`′(X) < `(X)
)− Pr
(`′(X) > `(X)
)(10.42)
=∑
x:`′(x)≤`(x)
p(x)−∑
x:`′(x)>`(x)
p(x) (10.43)
=∑x
p(x) sign(`(x)− `′(x)
)(10.44)
= E[sign
(`(X)− `′(X)
)](10.45)
≤∑x
p(x)(
2`(x)−`′(x) − 1)
(10.46)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.211/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code. . . proof of Theorem 10.4.2.
Let sign(t) =
1 t > 0
0 t = 0
−1 t < 0
Then sign(t) ≤ 2t − 1 for t = 0,±1,±2,±3, . . . . This gives:
Pr(`′(X) < `(X)
)− Pr
(`′(X) > `(X)
)(10.42)
=∑
x:`′(x)≤`(x)
p(x)−∑
x:`′(x)>`(x)
p(x) (10.43)
=∑x
p(x) sign(`(x)− `′(x)
)(10.44)
= E[sign
(`(X)− `′(X)
)](10.45)
≤∑x
p(x)(
2`(x)−`′(x) − 1)
(10.46)
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F52/54(pg.212/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
. . . proof of Theorem 10.4.2.
But since p(x) is dyadic, p(x) = 2−`(x), we get
=∑x
2−`(x)(2`(x)−`′(x) − 1
)(10.47)
=∑x
2−`′(x) −
∑x
2−`(x) (10.48)
=∑x
2−`′(x) − 1 (10.49)
≤ 1− 1 since `′(x) satisfies Kraft (10.50)
= 0
(10.51)
Thus, Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)as desired.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.213/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
. . . proof of Theorem 10.4.2.
But since p(x) is dyadic, p(x) = 2−`(x), we get
=∑x
2−`(x)(2`(x)−`′(x) − 1
)(10.47)
=∑x
2−`′(x) −
∑x
2−`(x) (10.48)
=∑x
2−`′(x) − 1 (10.49)
≤ 1− 1 since `′(x) satisfies Kraft (10.50)
= 0
(10.51)
Thus, Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)as desired.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.214/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
. . . proof of Theorem 10.4.2.
But since p(x) is dyadic, p(x) = 2−`(x), we get
=∑x
2−`(x)(2`(x)−`′(x) − 1
)(10.47)
=∑x
2−`′(x) −
∑x
2−`(x) (10.48)
=∑x
2−`′(x) − 1 (10.49)
≤ 1− 1 since `′(x) satisfies Kraft (10.50)
= 0
(10.51)
Thus, Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)as desired.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.215/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
. . . proof of Theorem 10.4.2.
But since p(x) is dyadic, p(x) = 2−`(x), we get
=∑x
2−`(x)(2`(x)−`′(x) − 1
)(10.47)
=∑x
2−`′(x) −
∑x
2−`(x) (10.48)
=∑x
2−`′(x) − 1 (10.49)
≤ 1− 1 since `′(x) satisfies Kraft (10.50)
= 0
(10.51)
Thus, Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)as desired.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.216/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
. . . proof of Theorem 10.4.2.
But since p(x) is dyadic, p(x) = 2−`(x), we get
=∑x
2−`(x)(2`(x)−`′(x) − 1
)(10.47)
=∑x
2−`′(x) −
∑x
2−`(x) (10.48)
=∑x
2−`′(x) − 1 (10.49)
≤ 1− 1 since `′(x) satisfies Kraft (10.50)
= 0 (10.51)
Thus, Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)as desired.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.217/220)
Huffman Shannon/Fano/Elias Next
Competitive optimality of Shannon code
. . . proof of Theorem 10.4.2.
But since p(x) is dyadic, p(x) = 2−`(x), we get
=∑x
2−`(x)(2`(x)−`′(x) − 1
)(10.47)
=∑x
2−`′(x) −
∑x
2−`(x) (10.48)
=∑x
2−`′(x) − 1 (10.49)
≤ 1− 1 since `′(x) satisfies Kraft (10.50)
= 0 (10.51)
Thus, Pr(`(X) < `′(X)
)≥ Pr
(`(X) > `′(X)
)as desired.
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F53/54(pg.218/220)
Huffman Shannon/Fano/Elias Next
Next time
Shannon games and stream codes
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F54/54(pg.219/220)
Huffman Shannon/Fano/Elias Next
Next time
Shannon games and stream codes
Prof. Jeff Bilmes EE514a/Fall 2019/Info. Theory I – Lecture 10 - Oct 30th, 2019 L10 F54/54(pg.220/220)