Information Theory:
Source Coding
Dr. Satoshi Nakamura
Honorarprofessor, Karlsruhe Institute of Technology, Germany
Professor, Graduate School of Information Science, Nara Institute of Science and Technology, Japan
Invited Advisor, National Institute of Information and Communications Technology, Japan
Fellow, Advanced Telecommunication Research Institute International, Kyoto, Japan
Spoken Language
Translation
Research Laboratories
2012/10/24 Prof. Satoshi Nakamura 2
NAIST?
Kenhanna Science City
(NAIST, ATR, NICT,
NTT, NEC…)
Fukushima
Nuclear Power Plant
>700km
Acknowledgements of appreciation to the
help and supports.
2012/10/24 Prof. Satoshi Nakamura 3
Daimler Special Purpose Car
(Total 4M Euro) 43 rescue members and 4 rescue dogs
Example: Deutsche Bank $2640,000
and more.
Deeply thank you !
Congratulations, Shinya Yamanaka
on Nobel Prize in Physiology or Medicine
Education
He received his M.D. at Kobe University in 1987 and his Ph.D. at Osaka City University Graduate School in 1993.
Professional career Between 1987 and 1989, Yamanaka was a Resident in orthopedic surgery at the National
Osaka Hospital.
During 1993–1995, he was a Postdoctoral Fellow at the Gladstone Institute of Cardiovascular Disease, which is affiliated with the University of California, San Francisco.
During 1995–1996, he was a staff research investigator at the UCSF-affililated Gladstone Institute of Cardiovascular Disease.
Between 1996 and 1999, he was an assistant professor at Osaka City University Medical School.
During 1999–2003, he was an associate professor at the Nara Institute of Science and Technology. During 2003–2005, he was a professor at the Nara Institute of Science and Technology. Between 2004 and 2010, Yamanaka was a professor at the Institute for Frontier Medical Sciences.[9]
Currently Yamanaka is the director and a professor at the Center for iPS Cell Research and Application in Kyoto University, Japan.
In 2006, he and his team generated Induced Pluripotent Stem Cells – pluripotent stem cells from adult mouse fibroblasts. In 2007, he and his team were able to generate Induced Pluripotent Stem Cells from human adult fibroblasts.[1][2][3]
24/10/2012 Satoshi Nakamura, NAIST, all right reserved. 4
2012/10/24 Prof. Satoshi Nakamura 5
NAIST?
Kenhanna Science City
(NAIST, ATR, NICT,
NTT, NEC…)
Fukushima
Nuclear Power Plant
>700km
About NAIST ?
Nara Institute of Science and Technology, Japan established 1991.
Japanese national university for basic research and higher education.
1st rank research evaluation among Japanese universities in #papers, #grand
per faculty.
Three graduate schools (No undergraduate school)
Information Science
Biological Science: Prof. Yamanaka IPS Cell.
Material Science
Sister school: JAIST, Japan Advanced Institute of Science and Technology
Graduate School of Information Science
20 laboratories
10 collaborative laboratories
(ATR, AIST, NEC, Panasonic, NTT, NICT, Fujitsu, Docomo, OMRON)
2012/10/24 Prof. Satoshi Nakamura 6
NAIST Ranking
Overall:Ranked 1st
Highest Evaluated University in Japan
based on data in Thomson Reuters’“Essential Science Indicators”and
published in“University Ranking 2010”by the leading Japanese
newspaper“Asahi Shimbun”
in the top 5% A+
Three research areas in the Graduate School of Information Science
received top scores in a survey conducted by the Ministry of Economy,
Trade and Industry
Ranked 1st in “Research”and“Education”among all national universities
in Japan published in the weekly magazine “Toyo Keizai”.
Number of Grants-in-Aid for scientific research Ranked 1st per faculty
member*
Grants-in-Aid for scientific research Ranked 1st per faculty member*
2012/10/24 Prof. Satoshi Nakamura 7
2012/10/24 Prof. Satoshi Nakamura 8
National Institute of Information and
Communications Technology, NICT
2001
Communications
Research Laboratory
1979
Telecommunications
and Broadcasting
Satellite Organization
Inco
rpo
rate
d
Ad
min
istrative
Ag
en
cy
Natio
nal
Lab
ora
tory
Certifie
d
Institu
tion
1992
Telecommunications
Advancement
Organization
1988
Communications
Research Lab
1952
Radio Research Lab
▼
▼
▼
National Institute of Information and
Communications Technology
About NICT ?
2004
2012/10/24 9 KIIT & CDAC Noida
Locations
2012/10/24 10 Prof. Satoshi Nakamura
NICT Keihanna Research Laboratories
(Open) since 1. April, 2008
(Location)Kansai Science City
(Number of Staffs)about 160
Kyoto
Nara
Oosaka
Kansai Science City
A part
of ATR
NICT
11 2012/10/24 11 Prof. Satoshi Nakamura
12 12
(I) Barriers of language
R&D on the multi-lingual technology
(III) Barriers of information quality
Information analysis with information credibility criteria
(II) Barriers of ability
spoken language and nonverbal interaction technology
(IV) Barriers between the real and the cyber world
Natual, real-time connections between the two worlds
Overcome the barriers in ICT society
Ultra-realistic communications to provide the feeling
of “being there” via all five senses, etc
(V) Barries of distance
2012/10/24 12 Prof. Satoshi Nakamura
About ATR?
ATR: Advanced Telecommunication Research Institute International
ATR was founded in March 1986.
2012/10/24 Prof. Satoshi Nakamura 13
ATR Laboratories
Brain Information Communication Research Labs Group
Computational Neuroscience Lab.
Cognitive Mechanisms Lab.
Neural Information Analysis Lab.
Social Media Research Labs. Group
Intelligent Robotics and Communication Lab.
Hiroshi Ishiguro Lab.
Adaptive Communications Research Lab.
Wave Engineering Lab.
Spoken Language Communication Research Labs.
Media Information Science Labs
2012/10/24 Prof. Satoshi Nakamura 14
History of Speech Translation Research
10/24/2012 15
Read
Speech
・Syntactically correct
・Clear utterance
・Limited domain
“Conference Registration”
Daily
Conversation
・Standard expression
・Unclear utterance
・Limited domain
“Hotel Reservation”
Wider and
Real Domain
・Wider and real domain
“International Travel”
・Realistic expressions
・Noisy speech
・J-E, J-C speech translation
1986 1992 1999 2006 2008
NICT
/ATR NICT
MASTAR
MIC & NICT
& CSTP PJ
ATR NICT
2010
A-STAR (→ U-STAR) C-STAR (ATR,CMU, UKA, CLIPS, IRST, ETRI,
CAS)
2012/10/24 Prof. Satoshi Nakamura 17
Source Coding
Contents of the lecture
Information Theory: Source Coding + Channel Coding + Encryption
Goal: Understanding of Source Coding by theory and application
Contents: Amount of information, modeling of information source
Zero-memory source, Markov source, hidden Markov source
Source coding theorem, compact codes
Universal coding, rate distortion theory
Source coding of analog signal, vector quantization
Modeling and coding of language and speech
2012/10/24 Prof. Satoshi Nakamura 18
Text book and references
Norman Abramson: “Information Theory and Coding”, McGraw-Hill,
1963
A.Gersho, R.M.Gray: “Vector Quantization and Signal Compression”,
Kluwer Academic Publisher
T.C.Bell, J.G.Cleary, I.H.Witten: “Text Compression”, Prentice Hall
2012/10/24 Prof. Satoshi Nakamura 19
Role of information theory
Information Theory: Measure for Information Amount, Modeling of Information Source
Claude Shannon: ``Mathematical Theory of Communication'' (1948), Bell System Technical Journal "Shannon entitles his theory a mathematical theory of communication:
Theory of carriers of information."
"Theory about carriers of information-symbols and not with information itself.”
"The semantic aspects of communication are irrelevant to the engineering problems."
2012/10/24 Prof. Satoshi Nakamura 20
Transmission model
Efficient usage of transmission channel
Digital channel: Reduction of transmission codes
Analog channel: Reduction of transmission time and frequency bands
Improve reliability
Digital channel: Reduction of transmission errors
Analog channel: Improve Signal to Noise Ratio
Information
Source
Transmitter
(Coder)
Transmission
channel
Receiver
(Decoder)
Decoded
Information
Message Code Code Message
Noise
Source
2012/10/24 Prof. Satoshi Nakamura 21
Separate modeling
Separate optimization:
Source coding + Channel coding
Information
Source
Source
Coding
Transmission
channel
Chanel
Decoding
Decoded
Information
Noise
Source
Channel
Coding
Soruce
Decoding
Coding Decoding
2012/10/24 Prof. Satoshi Nakamura 22
Amount of information
Amount of Information: Defined by statistical property of an overall set not by individual events.
Statistical Structure Statistically definable Sets
=> Memoryless source, Markov source
Non-statistical sets and unknown-structured sets
Unknown-structured information sets Universal Coding
Lempel Zip Coding
Arithmetic Coding
2012/10/24 Prof. Satoshi Nakamura 23
Hierarchical model of codes
Information Source Receiver
signal
Model
Structure,
Symbol
Concept
Meaning
Intention
signal
Model
Structure,
Symbol
Concept
Meaning
Intention
Transmitter Receiver
Waveform
coding
Parametric
coding
Recognition
-based coding
Intelligent
Coding
2012/10/24 Prof. Satoshi Nakamura 24
What is information
Messages which reduce uncertainty
Measurement of body temperature
Prediction whether he caught cold or not is possible.
Weather forecast
Prediction of tomorrows weather is possible.
Information theory:
Measurement of information
Higher efficiency and reliability of transmission
2012/10/24 Prof. Satoshi Nakamura 25
Properties of information
Non-negativity: Information amount is non-negative. If probability of the event equals to 0 or 1, amount of information becomes 0.
Events which does surely happen or doesn’t happen, don’t have any additional information. The amount of information of these events is 0.
To know the events whose probabilities are 0<p<1 bring certain amount of information since it reduces ambiguity.
Monotonic decreasing: The more amount of information the less probability the event has.
Amount of information is bigger if the event is unexpected.
2012/10/24 Prof. Satoshi Nakamura 26
Amount of information: Additivity
How much is the amount of information, I(pq), of an joint event with
probability p and q ?,
where,
I(p): amount of information of an event with probability p
I(q): amount of information of an event with probability q
I(pq) = I(p) + I(q)
means,
amount of information is same if given once or one by one.
2012/10/24 Prof. Satoshi Nakamura 27
Amount of information
Only function form which satisfies the above three properties is,
Now, I(P) is defined as amount of information.
Units of amount of information
[bit]
[nat]
[dit] or [Hartley]
If p=0.5, I(p) is maximum. -> only valid for the average case!
Amount of information by [bit] represents average number of [yes/no] questions to know what event has happened.
).log()( ppI
)(log2 p
)(log pe
)(log10 p
2012/10/24 Prof. Satoshi Nakamura 28
What is coding?
Binary coding of the decimal digits.
Message Symbols:
0,1,…,9
Code word:
0000,0001,0010,…
Backward decoding is straight-
forward in this example.
Decimal
number
Binary
representation
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001
2012/10/24 Prof. Satoshi Nakamura 29
What is coding ?
A binary code.
Backward decoding is NOT
straightforward.
111001 can be generated by
“S4S3” and “S4S1S2”
Message
Symbols
Binary
representatio
n
s1 0
s2 01
s3 001
s4 111
2012/10/24 Prof. Satoshi Nakamura 30
What is coding ?
Another binary code
Use “0” as a separator.
Backward decoding is unique and
straightforward.
Message
Symbols
Binary
representation
s1 0
s2 10
s3 110
s4 1110
2012/10/24 Prof. Satoshi Nakamura 31
One problem in coding
Weather in San Francisco
Code alpha:
Two binary digits are used.
“Sunny, Foggy, Foggy, Cloudy,”
comes to “00111101”.
Two binary digits are necessary
to backward decoding.
Message
Symbols
Binary
representation
Sunny 1/4
Cloudy 1/4
Rainy 1/4
Foggy 1/4
Message
Symbols
Codes
Sunny 00
Cloudy 01
Rainy 10
Foggy 11
2012/10/24 Prof. Satoshi Nakamura 32
One problem in coding
Weather in Los Angels
Code beta:
Two binary digits are used.
Probabilities are non-uniform
“Sunny,Smoggy,Smoggy, Cloudy” comes to “1000110”.
Waiting for 0 is necessary to backward decoding.
Average code length = 1 7/8 < 2 binit.
Message
Symbols
Binary
representation
Sunny 1/4
Cloudy 1/8
Rainy 1/8
Smoggy 1/2
Message
Symbols
Codes
Sunny 10
Cloudy 110
Rainy 1110
Smoggy 0
messagebinits
SmoggyRainyCloudySunnyL
/8
71
2
11
8
14
8
13
4
12
)Pr(1)Pr(4)Pr(3)Pr(2
2012/10/24 Prof. Satoshi Nakamura 33
Amount of information
TV: Black, white, and gray dots, with roughly 500 rows and 600 columns.
Namely 500x600=300,000 dots may take on any one of 10 distinguishable
brightness levels. (p= 1/10 300,000)
Radio: 10,000 words vocabulary announcer selects 1,000 words randomly.
(p= 1/10,000 1,000)
TV picture is worth more 1,000 words.
bitsEI 61010log000,300)(
bitsEI 4103.1000,10log000,1)(
2012/10/24 Prof. Satoshi Nakamura 34
Average amount of information
Amount of information is defined by,
Average amount of information of the information source A is
defined by,
and, H(A) satisfies,
Entropy = Average amount of information
).log()( ppI
)(log)()()()( 2
11
i
n
i
ii
n
i
i ePePeIePAH
nAH 2log)(0
(bit).
2012/10/24 Prof. Satoshi Nakamura 35
Entropy
Entropy represents ambiguity of the information source.
When one message ei is received, ambiguity of the information
H(A) is decreased.
This amount of decrease is equivalent to the amount of
information of the message ei.
2012/10/24 Prof. Satoshi Nakamura 36
Properties of Entropy
Now we have source alphabet {0,1}, and
Entropy function is like,
.1)1(,)0( PP
2012/10/24 Prof. Satoshi Nakamura 37
Amount of information for multiple events
Amount of information for multiple events can be defined by the decrease of the Entropy. Now let P(ai) be a prior probability of a message ai, and P(ai|bi) be a posterior probability of ai given a message bi. A prior Entropy of information source A is defined by,
and, a posterior Entropy of information source A given a message bj is defined by,
Therefore, Amount of information of multiple events
)(
1log)()(
iA
iaP
aPAH
)/(
1log)/()/(
ji
j
A
ijbaP
baPbAH
)/()( jbAHAH
2012/10/24 Prof. Satoshi Nakamura 38
Conditional Entropy
Conditional Entropy is expectation of H(A|bj).
And following inequality holds,
)|(log),(
)|(log)|()()|(
1 1
1 1
ji
n
i
m
j
ji
jiji
n
i
m
j
j
baPbaP
baPbaPbPBAH
)()()()|(0 BHAHBAHBAH
2012/10/24 Prof. Satoshi Nakamura 39
Mutual Information
Amount of information of multiple events
What is an amount of information if we know information source B not
a single message of bj of B.
I(A;B) is called “Mutual Information”.
)/()( jbAHAH
BA ji
ji
ji
BA ji
ji
A i
i
bPaP
baPbaP
baPbaP
aPaP
BAHAHBAI
,
,
)()(
),(log),(
)/(
1log),(
)(
1log)(
)/()();(
2012/10/24 Prof. Satoshi Nakamura 40
Joint Entropy
Entropy of joint information source A and B is defined by,
),(log),(),(1 1
ji
n
i
m
j
ji baPbaPBAH
2012/10/24 Prof. Satoshi Nakamura 41
Mutual Information
Mutual Information I(A;B) holds,
),()()(
)|()(
);();(
)();(0
BAHBHAH
ABHBH
ABIBAI
AHBAI
2012/10/24 Prof. Satoshi Nakamura 42
Mutual Information (example)
Initial Entropy of A, H(A) is,
The winning rate after we know he plays a game becomes 0.6. The Entropy H(A|bi) is,
Entropy increases by knowing the information of bi.
If we know he doesn’t play, the winning rate is 0.93. This time, Entropy decreses.
Now mutual information is,
A B Play (b1) Not Play(b2) P(ai)
Win(a1) 0.42(0.6) 0.28(0.93) 0.7
Lose(a2) 0.28(0.4) 0.02(0.07) 0.3
P(bi) 0.7 0.3 1.0
88.03.0log3.07.0log7.0)( AH
97.04.0log4.06.0log6.0)|( 1 bAH
17.007.0log07.093.0log93.0)|( 2 bAH
009.079.088.0)17.03.097.07.0(88.0
)|()();(
BAHAHBAI
Goal of 1st day
2012/10/24 Prof. Satoshi Nakamura 43
ROLE OF SOCIAL MEDIA
2012/10/24 Prof. Satoshi Nakamura 44
Credibility Increased information Source
2012/10/24 Prof. Satoshi Nakamura 45
NHK
Portal sites
Social media
Academia
Government
Commercial TV
News papers
Credibility Decreased Information Source
2012/10/24 Prof. Satoshi Nakamura 46
NHK
Portal sites
Social media
Academia
Government
Commercial TV
News papers
Trends of Social Media Users
2012/10/24 Prof. Satoshi Nakamura 47
#thousands users
March June Aug. Oct. Dec. Feb. Mar. ‘11
Weekly Trends of #users
2012/10/24 Prof. Satoshi Nakamura 48
2012/10/24 Prof. Satoshi Nakamura 49
Models for information sources
Information
source
Zero-memory
information
source
With memory
Information
source
Stationary
Information
source
Non-stationary
information
source
Non-ergodic
information
source
Ergodic
information
source
Stationary
Information
source
Non-stationary
information
source
2012/10/24 Prof. Satoshi Nakamura 50
Models for information sources
Zero-memory information source:
Source alphabets in S={s1, s2, s3,…,sq} are mutually independent and
independent to alphabets in history. Zero-memory information source is
completely described by the source alphabet S and their probabilities,
P(s1),P(s2),..,P(sq).
Markov information source:
Probability of the source alphabet Si is described by previous m alphabets.
If m=1, it is called 1st order Markov Model. Probabilities of the alphabet is
described by, P(si|sj1,sj2,..,sjm) i=1,2,..q; jq=1,2,..,q.
2012/10/24 Prof. Satoshi Nakamura 51
Models of information source
Stationary information source:
Probabilities of the specific source alphabets are invariant to time shift.
Ergodic information source:
Observed source alphabet sequence becomes same as a representative one
with probability 1, when we observe the source alphabet sequence for
long time.
2012/10/24 Prof. Satoshi Nakamura 52
Zero-memory information source
Zero-memory information source: Successive symbols emitted from the source are statistically independent, which is described by source alphabet S and the probabilities with which the symbols occur:
An amount of information for one symbol si is,
An average amount of information for information source S is,
Entropy H(S) of zero-information information source is,
Source si, sj, …
)(,),(),( 21 qsPsPsP
bitssP
sIi
i)(
1log)(
S
ii sIsP )()(
S i
isP
sPsH)(
1log)()(
2012/10/24 Prof. Satoshi Nakamura 53
Examples
Source S;
If I(si) is measured in r-ary units, we have
4
1)()(,
2
1)(},,,{ 321321 sPsPsPsssS
bitsSH2
34log
4
14log
4
12log
2
1)(
r
SHSH
unitsaryrsP
sPSH
r
S i
ri
log
)()(
)(
1log)()(
2012/10/24 Prof. Satoshi Nakamura 54
Some properties of Entropy
1 xy lies above xy ln
1ln xxwith equality if, and
only if x=1
xx
11
ln
1
,,0,0
11
q
j
j
q
i
i
i
yx
jandiforyx
i
i
q
i
i
i
i
q
i
ix
yx
x
yx ln
2ln
1log
11
2012/10/24 Prof. Satoshi Nakamura 55
Some properties of Entropy
with equality if, and only if, xi=yi for all i.
This is called Jensen’s inequality.
i
q
i
i
i
q
i
i
q
i
i
q
i
i
i
i
q
i
i
i
i
q
i
i
yx
xx
or
xy
x
yx
x
yx
1log
1log
,0
)(2ln
1
)1(2ln
1log
11
11
11
2012/10/24 Prof. Satoshi Nakamura 56
Some properties of Entropy
i
q
i
iP
PsH1
log)(1
i
q
i
i
i
q
i
i
i
q
i
i
q
i
i
qPPe
qPP
PPqPsHq
lnlog
log
1loglog)(log
1
1
11
0
)1
(log
)1
1(log)(log
11
1
q
i i
i
q
i
i
i
q
i
i
P
P
qPe
qPPesHq
H(s) always less than or equal
to, log q. Equality holds if, and
only if, Pi=1/q for all i.
2012/10/24 Prof. Satoshi Nakamura 57
Properties of Entropy
Again, we have source alphabet {0,1}, and
Entropy function is like, .1)1(,)0( PP
2012/10/24 Prof. Satoshi Nakamura 58
Extensions of a zero-memory source
Extensions to blocks of symbols. For instance, suppose two binary source alphabet case, 00, 01, 10, and 11.
Definition: Let S be a zero-memory information source with source alphabet {s1,s2,…,sq} and with the probability of si equal to Pi. Then the n-th extension of S, Sn, is a zero-memory source with qn symbols . Each corresponds to some sequence of n of the si. P( ), the probability of , is just the probability of the corresponding sequence of si’s. That is, if corresponds to (si1,si2,..,sin), then P( )=Pi1,Pi2,…,Pin.
i
nq ,...,, 21
ii
ii
2012/10/24 Prof. Satoshi Nakamura 59
Extension of zero-memory source
Entropy:
)(
1log)()(
is
i
n
PPsH
n
nn
nn
nn
s i
i
s i
i
s i
i
iiis
i
n
PP
PP
PP
PPPPsH
1log)(
1log)(
1log)(
1log)()(
21
21
)(
1log
1log
1log
1log
1log)(
1
1
11
1
3
3
2
2
11
1
1
21
1
1
111
sH
PP
PP
PPPP
P
PPPP
PP
i
q
S
i
i
q
i
i
q
i
q
i
ii
q
i
i
i
q
i
i
is
iii
is
i
n
n
nn
n
)()( snHsH n
2012/10/24 Prof. Satoshi Nakamura 60
Markov Information Source
A more general type of information source with q symbols than the zero-memory
source is one in which the occurrence of a source symbol si may depend on a finite
number m of preceding symbols. Such a source, mth-order Markov source, is
defined by giving the source alphabet S and the set of conditional probabilities.
State: the probability of emitting a given symbol is known if we know the m
preceding symbols. We call the m preceding symbols as a state of the mth-order
Markov source.
qjqiforssssP pjjji m,,2,1;,,2,1),,,(
21
2012/10/24 Prof. Satoshi Nakamura 61
Markov information source
Ergodic Markov source Non-ergodic Markov source Non-Stationary
2012/10/24 Prof. Satoshi Nakamura 62
Entropy for Markov source
If we are in the state specified by (sj1,sj2,…,sjm), then the conditional probability of receiving symbol si is P(si/sj1,sj2,..,sjm). The information we obtain if si occurs while we are in state (sj1,sj2,..,sjm) is,
amount of information per symbol while we are in state (sj1,sj2,…, sjm ) is given by,
If we average this quantity over the qm possible states, we obtain the average amount of information by a product of the above entropy and steady state probability , namely the entropy of the mth-order Markov source S.
m
mm
s
jjjjjj sssSHsssPSH ),,,(),,,()(2121
S
jjjijjjijjj mmmssssIssssPsssSH ),,,(),,,(),,,(
212121
),,,(
1log),,,(
21
21
m
m
jjji
jjjissssP
ssssI
2012/10/24 Prof. Satoshi Nakamura 63
Entropy of Markov source
Entropy of mth-order Markov source is given by,
If S is zero-memory source,
121
21
121
2121
21
2121
),,,(
1log),,,,(
),,,(
1log),,,(),,,(
),,,(
1log),,,(),,,()(
mm
m
mm
mm
mm
mm
S jjji
ijjj
S jjji
jjjijjj
S jjji
jjji
S
jjj
ssssPssssP
ssssPssssPsssP
ssssPssssPsssPSH
)(),,,(21 ijjji sPssssP
m
2012/10/24 Prof. Satoshi Nakamura 64
Example
Probabilities for the Markov source
4/14
1/14
1/14
1/14
1/14
1/14
1/14
4/14
5/14
5/14
1/14
1/14
1/14
1/14
5/14
5/14
2012/10/24 Prof. Satoshi Nakamura 65
Adjoint source
Definition: Let S={s1,s2,…,sq} be the source alphabet of an mth-order
Markov source, and let P1,P2,…,Pq be the first-order symbol probabilities
of the source. The adjoint source to , written , is the zero-memory
information source with source alphabet identical with that of S, and with
symbol probabilities P1,P2,…,Pq,
here the following relationship holds,
SS
).()( SHSH
2012/10/24 Prof. Satoshi Nakamura 66
Adjoint source
Let S be a 1st order Markov source,
By applying Jensen’s inequation, ),(
)()(log),(
2ij
ij
i
S
jssP
sPsPssP
0)|(
)(log),(
),(
)()(log),(
22
S ij
iij
ij
ij
i
S
jssP
sPssP
ssP
sPsPssP
)(
)(
1log)(
)(
1log)()|(
)(
1log),(
)|(
1log),()(
22
SH
sPsP
sPsPssP
sPssP
ssPssPSH
is
i
i
ii
s s
j
i
i
S
j
ij
i
S
j
i
i j
i
q
i
i
i
q
i
i
i
i
q
i
i
yx
xx
orx
yx
1log
1log
,0log
11
1
2012/10/24 Prof. Satoshi Nakamura 67
Extension of a Markov source
Definition: Let S be an mth-order Markov information source with source
alphabet (s1,s2,…,sq) and conditional symbol probabilities P(si/sj1,sj2,…,sjm).
Then the nth extension of S, Sn, is a th-order Markov source with qn
symbols, . Each corresponds to some sequence of n of the
si, and the conditional symbol probabilities of are .
is given by , here [ ] is a minimum integer number bigger
than m/n. Entropy is given by,
nq ,,, 21
i
i
),,,|(21
jjjiP
]/[ nm
).()(
)|(
1log),()(
SnHSH
PPSH
n
S S ji
ij
n
n n
2012/10/24 Prof. Satoshi Nakamura 68
Extension of a Markov source
Example:
1]/[,3,1 nmnm
)|,,(
),,|,,()|(
112
12312
tttt
ttttttji
ssssP
ssssssPP
n
n n
S ji
ij
S S ji
ij
n
PP
PPSH
2 )/(
1log),(
)/(
1log),()(
)|()|()|()|(
),,,(),,,|()(
1
)(/),,..,,(
)|,..,,()|(
112211
1111
21
21
jiiiiiii
jiijiii
j
jjiii
jiiiji
ssPssPssPssP
sssPssssPsP
sPssssP
ssssPP
nnnn
nnn
n
n
)(
)/(
1log),(
)/(
1log),(
)/(
1log),()(
21
221
21
SnH
ssPP
ssPP
ssPPSH
nnn
n
n
S ii
ij
S ii
ij
S ji
ij
n
2012/10/24 Prof. Satoshi Nakamura 69
Adjoint source of extended Markov source
Adjoint source of extended Markov source, .
Let be the first-order symbol probabilities of the symbols
of the nth extension of the first-order Markov source. Since corresponds to the
sequence , we see that may also be thought of as the nth-order
joint probability of the .
If S is a first-order Markov source.
ns
)(,),(),( 21 nqPPP
i
i
),,,(21 niii sss )( iP
kis
nn
n
n
S iii
iii
S i
i
n
sssPsssP
PPSH
),,,(
1log),,,(
)(
1log)()(
21
21
)()()()(),,,(12312121
nnn iiiiiiiiii ssPssPssPsPsssP
)]()([)()(
)()1()(
)(
1log
)(
1log
)(
1log),,,()(
1121
21
SHSHSnHSH
orSHnSH
ssPssPsPsssPSH
n
S iiiii
iii
n
nnn
n
2012/10/24 Prof. Satoshi Nakamura 70
Adjoint source of extended Markov source
m
n SnHSH )()(
nSH
n
SH m
n )(
)( )()()( SnHSHSH nn
)()(
lim SHn
SH n
n
)()(nn SHSH
)()( SnHSH n
This inequality becomes less important as n becomes larger.
For larger n, the Markov constraints on the symbols from Sn becomes
less and less important. The adjoint of the nth extension of S is not the
same as the nth extension of the adjoint of S.
If is a zero-memory source, S
2012/10/24 Prof. Satoshi Nakamura 71
Example
Probabilities for the Markov source
4/14
1/14
1/14
1/14
1/14
1/14
1/14
4/14
5/14
5/14
1/14
1/14
1/14
1/14
5/14
5/14
2012/10/24 Prof. Satoshi Nakamura 72
Examples
bitsSH
bitsSH
bits
ssPssPSH
bitsSHSH
bitSH
bitSH
S kj
ij
47.3)(
66.2)(
86.1
),(
1log),()(
62.1)(2)(
00.1)(
81.0)(
3
3
2
2
2
Note how the sequence approaches H(S).
bitSH
bitSH
bitSH
bitSH
87.04
)(,89.0
3
)(
93.02
)(,00.1)(
43
2
2012/10/24 Prof. Satoshi Nakamura 73
Example: English
27 symbols: 26 alphabets + space
symbolbits
SH
/75.4
27log)(
symbolbits
PPSH
iS
i
/03.4
1log)(
2012/10/24 Prof. Satoshi Nakamura 74
Example: English
1st order Markov source:
2nd order Markov source
symbolbitsjiP
jiPSHS
/32.3)/(
1log),()(
2
2012/10/24 Prof. Satoshi Nakamura 75
Example: English
Word-based zero-memory source
Word-based 1st order Markov source
2012/10/24 Prof. Satoshi Nakamura 76
Estimation of parameters of Markov source
Estimation of P(si/sj) from samples emitted from
the information source.
Regular 1st order Markov source Non-regular 1st order Markov source
2012/10/24 Prof. Satoshi Nakamura 77
Estimation of parameters of Markov source
The state transition sequence of the Markov source associated the emitted output symbols is
uniquely determined. We maximize the following probability P, if P is a joint probability of N
observed samples.
,where are initial and final state probabilities, respectively. PA(a)=P(a|a) is conditional
probability of state transition.
Now find conditional probabilities which maximize log P under the following
constraints by the Lagrangean method.
The optimal conditional probabilities are given by,
FbPaPbPaPWPc
B
c
B
c
A
c
A4321 )()()()(0
FW ,0
1)()(,1)()(, bPaPbPaPNc BBAAi
21
2
21
1 ,cc
cb
cc
ca PP AA
43
4
43
3 ,cc
cb
cc
ca PP BB
2012/10/24 Prof. Satoshi Nakamura 78
Estimation of parameters of Markov source
The Lagrangean Method:
Our aim is to maximize the above objective function under constraints
of . For simplicity, we maximize the Q=log P function
instead.
By taking derivative for each parameter, now we have,
FPPPPWPC
B
C
B
C
A
C
Ababa
4321
0
1,1 baba PPPP BBAA
)()( 11log21
baba PPPPPQ BBAA
01
,01
,0,0
,0,0
2
1
21
21
42
31
baQ
baQ
bP
PC
bP
Q
bP
PC
bP
Q
aP
PC
aP
Q
aP
PC
aP
Q
PP
PP
BB
AA
BBAA
BBAA
43
4
21
2
43
3
21
1
,
,,
cc
cb
cc
cb
cc
ca
cc
ca
PP
PP
BA
BA
Finally, we obtain,
2012/10/24 Prof. Satoshi Nakamura 79
Estimation of parameters of Markov source
These are nothing but a relative frequencies of symbols sequences observed
through state sequences. Now let NA be a frequency of state A and NA(b) a
frequency of the symbol b produced at state A.
p(b|a) can be calculated by,
Let P(A,a) be a joint probability of symbol a produced at the state A, and P(A) be
a probability of state A.
43
4
21
2
43
3
21
1 ,,,cc
cb
cc
cb
cc
ca
cc
ca PPPP BABA
.
,|
NN
NN
PA
A
A
b
a
baabpb
)(
),(|
AP
aAPaapaPA
2012/10/24 Prof. Satoshi Nakamura 80
State transition matrix
Definition: Matrix representation of conditional probabilities.
Let P(a),P(b) be state transition matrices for symbol a and b, and let A, W0=[1,0], WF=[1,1]
be an initial state, an initial state probability and a final state probability.
Now we can calculate a probability for the observed symbols with arbitrary length.
0
0
|
|
bap
aapaP
bbp
abpbP
|
|
0
0
},{,,...,210 baSWSPSPSPWM itFm
bbp
abp
bap
aapP |
|
|
|
2012/10/24 Prof. Satoshi Nakamura 81
State transition matrix
Limit distribution: Let W0 be an initial state probability vector with an initial
probability at state i, time n=0, and let P and
be a state probability vector at state j, time n.
Limit distribution is given by,
PWWW n
nn
n0limlim
i],,,[ )()(
2
)(
1
n
k
nn
nW
2012/10/24 Prof. Satoshi Nakamura 82
Regular Markov source
Definition:
Pn converges to an unique matrix as n becomes large.
Each column vector converges to an unique state probability vector
, where each element is positive.
Steady state distribution exists uniquely and is equal to .
Steady state distribution is Z=(z1,z2,…,zk), which satisfies,
Example:
The steady state vector is,
P
WW
.11
,
k
i
ZZZP i
8.02.0
3.07.0P ZZZ
ZZZ
221
121
8.03.0
2.07.0
6.0,4.0 21 ZZ
2012/10/24 Prof. Satoshi Nakamura 83
Example
2/1)|()|(,2/1)|()|(
,4/3)|()|(,4/1)|()|(
BbPbbPBaPbaP
AbPabPAaPaaP
12
1
4
1
2/12/1
4/34/1)()(
2/12/1
4/34/1
)|()|(
)|()|(
,
,,
ZZZZZ
ZZZZ
BbPBaP
AbPAaPP
babaa
baba
Now we have, .5
3,
5
2 BPZAPZ ba
92.0)|(
1log)()|(
)|(
1log)()|(
)|(
1log)()|(
)|(
1log)()|(
)|(
1log)()|()(
2
BbPBPBbP
BaPBPBaP
AbPAPAbP
AaPAPAaP
SSPSPSSPsH
ji
jj
s
iEntropy is given by,
2012/10/24 Prof. Satoshi Nakamura 84
Example
Entropy of the extended Markov source is,
97.05/3
1log
5
3
5/2
1log
5
2)( SH
Goal of 2nd day
2012/10/24 Prof. Satoshi Nakamura 85
2012/10/24 Prof. Satoshi Nakamura 86
Hidden Markov information source
Information source with k symbols can be represented by nth Markov source with kn
states.
If we merge states which have similar behavior, we can have a non-deterministic
automata. This is called a hidden Markov source model.
The hidden Markov source model doesn’t have unique state sequence for the
observed symbol sequence.
2012/10/24 Prof. Satoshi Nakamura 87
Hidden Markov source
Definition: Non-deterministic probabilistic automata or Markov source model.
The unique state sequence cannot be obtained by observed symbol sequences.
If we let an initial state be q1, an final state be q3, the symbol sequence abab can be
produced by the following state sequences.
qqqqqQqqqqqQ
qqqqqQqqqqqQ
qqqqqQqqqqqQ
333216332215
322214332113
322112321111
,
,
,
2012/10/24 Prof. Satoshi Nakamura 88
Hidden Markov source
P(Q1) can be calculated by
Now we have,
[Forward calculation]: Now let the observed symbol sequence for the source,
We try to estimate probability of P(X|M) assuming a hidden Markov information
source. An initial and final probabilities holds,
,where I and F are an initial and final state set.
0031752.06.05.08.07.03.03.07.03.01
PQ
734832.01 QPP
.,...,, 21 IxxxX
FqIq k
I
k )()(
0 ,
2012/10/24 Prof. Satoshi Nakamura 89
Probability of observed symbol sequence
The probability of the observed symbol sequence x on the model M is given by,
Now we apply 1st order Markov assumption,
Now,
)()|( )()(
1
)(
0,1,1
i
kkI
iQ
kxbaMXP
iqiqiqiq
k
kk Q
kk
Q
k QPQXPMQXPMXP )()|()|,()|(
)|()()(
1
)( k
i
k
ii
k qqPQP
)|()|()()(
1
k
i
k
iii
k qqXPQXP
)(
)|()|(
)|()|()|(
)()(
)()(
1
)(
1
)(
)(
1
)()()(
1
11i
k
k
qqQ
i
k
i
k
ii
k
i
k
i
Qi
k
i
k
ii
k
i
k
ii
Qi
x
qqXPqqP
qqPqqXPMXP
baiiii
k
k
k
State transition probability
Emission probability
2012/10/24 Prof. Satoshi Nakamura 90
Hidden Markov source
Now let be probabilities of the initial state, the probability of
observed symbol sequence given the model is,
and, let forward probabilities in the following,
we get
Iq
ii
i
)1(:
)()|(,1,1
)()(
1
)(
0 iqqk
qqk
I
iallQ
kxbaMXP
iiii
k
Sifori i ...2,1)0,(
)()1,(),( tji
j
ji xbatjti
Fii
IiMXP,
),()|(
2012/10/24 Prof. Satoshi Nakamura 91
Probability of observed symbol sequence
If we apply a forward probability ,
and if we apply a backward probability ,
)()1,(),( ijiji
j
xbatiti
Fi
IiMXP ),()|(
j
tijij tjxbati )1,()(),(
i
i
iMXP )0,()|(
2012/10/24 Prof. Satoshi Nakamura 92
Trellis calculation
Three paths:
abc+dec+dfg (state No.
time)
Node No. path
N1 a
N2 d
N3 ab+de
N4 df
N5 (ab+de)c+dfg
=abc+dec+dfg
(0,1)
(1,1)
(1,2)
(2,2)
(2,3)
2012/10/24 Prof. Satoshi Nakamura 93
Parameter estimation of HMM source
State transition sequence cannot be determined uniquely in the HMM while the symbol sequence is observed. Once number of transitions between states is obtained, state transition probabilities and emission probabilities can be estimated easily.
EM (Expectation and Maximization) algorithm: Iterative algorithm for parameter estimation.
Expectation Step: Find state sequence to observed sequence based on the assumed HMM model parameters.
Maximization Step: Estimate HMM parameters along the state sequences, which maximize the probability to observed symbol sequence.
,here HMM parameters include state transition parameters and emission parameters.
2012/10/24 Prof. Satoshi Nakamura 94
EM algorithm
Leonard Baum proved the following important inequation.
,where is an assumed HMM parameter set, is an estimated HMM parameter set by
EM algorithm.
Let A={ai} be a state sequence estimated by the observed symbol sequence. We
modify the objective function as follows,
by taking logarithm,
Now we take expectation, over estimated state sequences,
)()(ˆ XPXP )1(.ˆ with equality if, and only if
)2(.)|(
),(
),(
),()()(
ˆ
ˆ
ˆ
ˆ
ˆˆ XAP
XAP
XAP
XAPXPXP
)3().|(log),(log)(log ˆˆˆ XAPXAPXP
)4().(log)(log)|()]([log ˆˆ|ˆ XPXPXaPXPE ia
XAi
XAE |[]
2012/10/24 Prof. Satoshi Nakamura 95
EM algorithm
If we substitute (4) with (3),
Now we recall Jensen’s inequality.
Apply Jensen’s inequality to the second term in the right side .
,with equality if, and only if
)5()|(log)|(),(log)|(
])|([log)],([log
])([log)(log
ˆˆ
|ˆ|ˆ
|ˆˆ
XaPXaPXaPXaP
XAPEXAPE
XPEXP
iia
iia
XAXA
XA
ii
RR
dxxgxfdxxfxf )(log)()(log)(
,where f(x),g(x) are probability density function.
with equality if, and only if f(x)=g(x)
)6()|(log)|()|(log)|( ˆ XaPXaPXaPXaP iia
iia ii
.ˆ,),|()|(ˆ isthatXaPXaP ii
2012/10/24 Prof. Satoshi Nakamura 96
EM algorithm
Now we have,
If we set 1st term in right side to be as follows,
Namely,
Equation (5) holds,
).|(log)|(),(log)|()(log ˆˆ XaPXaPXaPXaPXP iia
iia ii
)7(),(log)|(),(log)|( ˆ XaPXaPXaPXaP iia
iia ii
)8(,]),([log)],([log ||ˆ XAXA XAPEXAPE
)9()|(log)|(),(log)|()(log ˆ XaPXaPXaPXaPXP iia
iia ii
2012/10/24 Prof. Satoshi Nakamura 97
EM algorithm
In summary,
If equation (7) holds, we obtain parameters which satisfy,
).(log
)|(log)|(),(log)|(
)|(log)|(),(log)|(
)|(log)|(),(log)|(
)(log
ˆ
ˆˆ
ˆ
XP
XaPXaPXaPXaP
XaPXaPXaPXaP
XaPXaPXaPXaP
XP
iia
iia
iia
iia
iia
iia
ii
ii
ii
).(log)(log ˆ XPXP
2012/10/24 Prof. Satoshi Nakamura 98
Parameter estimation by EM algorithm
As in the previous slides, parameter estimation can be achieved by
maximizing .
, where can be calculated using parameter .
Numerator of is a joint probability of events of observing X and state
sequence ak.
Denominator of is a probability of observing X based on the HMM.
XA
XAPE|ˆ ),(log
),(log,
),(log)|(
ˆ
ˆ
XaPXP
XaP
XaPXaPE
kk
a
kka
k
k
XP
XaP k ,
XP
XaP k ,
XP
XaP k ,
2012/10/24 Prof. Satoshi Nakamura 99
Parameter estimation by EM algorithm
Now, we have by counting state transitions along the state sequence ak.
,where cij and dij are counts of state transition aij and bij(xt), respectively.
Then E can be re-written by,
if we let , we have E as follows.
),(ˆ XaP k
)(
)(),(
)()()(
0
)(
1
)(
1
)(
0ˆ
i
dk
ij
ck
ijq
k
i
k
k
I
Ii
k
k
xba
xbaXaP
ijij
i
iiii
)(log
)(log,
,)(
,)(
,)(
0
)()()(
0
i
dXP
XaPk
ij
cXP
XaPk
ij
XP
XaPk
i
dk
ij
ck
ij
kk
a
xba
xbaXP
XaPE
ijk
kaijk
kak
ka
ijij
k
XP
dXaPd
XP
cXaPc
k
ijk
aij
k
ijk
aij kk
)()( ,',
,'
)('log'')(
0 i
d
ij
c
ij
k xbaE ijij
2012/10/24 Prof. Satoshi Nakamura 100
Parameter estimation by EM algorithm
This is nothing but a probability function of a Markov source. Thus we can obtain
parameters by maximization of E, with .
For aij, can be thought as a relative counts of the state transition
from state i to state j. Thereby, we have,
If use , we have,
0
ija
E
XP
cXaPc
k
ijk
aij k
)(,'
.'
'ˆ
ijj
ij
ijc
ca
t ii
jtijiji
tt
txbattji
)()(
)1()()(),,(
),,(
),,(
'
'ˆ
,
tji
tji
c
ca
jt
t
ijj
ij
ij
2012/10/24 Prof. Satoshi Nakamura 101
Parameter estimation of HMM source
First we define backward probability , which is a probability at state qi, time=t
emitting xi, xi+1,xt+2,…,xI.. This probability can be efficiently calculated from the
final symbol.
Initial setting:
Iteration of backward path:
The following relationship holds.
),( ti
nQqfor ,1
Fqifq :0.1)0,(
otherwiseq :0.0)0,(
0,1...,,2,1 iItfor
nQqfor ...,,2,1
)()1,(),( 1}0:{{ tqjqjajj xbatjtqqj
iiSq
iFi
qIqi
)0,(),(
2012/10/24 Prof. Satoshi Nakamura 102
Parameter estimation of HMM source
Let be an emission probability producing xt during a transition from state
qi to qj. Now can be calculated using
and .
Here, represents a probability (relative transition counts) producing xt
during a transition from state qi to state qj assuming an HMM .
),,( tji),,( tji
)1,( ti ),( tj
)|(
),()()1,(),,(
MxP
tjxbatitji
tijij
),,( tji}),(,{ itijij xba
2012/10/24 Prof. Satoshi Nakamura 103
Parameter estimation of HMM source
Now we have the following estimation formulae.
),,(
),,(
lji
lji
ji
j
i
),(),(
),()()1,(
),,(
),,(
titi
tjxbati
tji
tjia
t
tijijt
jt
tij
),()()1,(
),()()1,(
),,(
),,()(
::
tjxbati
tjxbati
tji
tjikb
tijijt
tijijkxt
t
kxt
ijtt
2012/10/24 Prof. Satoshi Nakamura 104
Parameter estimation of HMM
The calculations above will be iterated until its convergence. Also parameter
estimation will be applied not to a single observation but to many symbol
observations like,
represents a probability that information source produces a symbol xi during a
state transitions from state qi to qj, assuming the symbol sequence x is observed
regardless to the state sequences.
At least the calculation for is the same as that of the Markov source model.
.),,(
),,()(
1
)(
1
tji
tjia
n
jt
Nn
n
t
Nn
ij
2012/10/24 Prof. Satoshi Nakamura 105
Entropy for HMM source
Let Entropy per one symbol at a state qj is given by,
We obtain Entropy for the HMM taking expectation over all states.
,where is steady state probabilities for the HMM states.
)|(log)|()|( jjx
j qxpqxpqXH
)()|( xbaqxp jkjkk
j
)|()()( jjj
qXHqXH
)( jq
2012/10/24 Prof. Satoshi Nakamura 106
An example
Estimate HMM parameters based on observed symbol sequence “ba”.
Step 1:
426.0
0189.01.03.09.07.0
3969.09.07.09.07.0
0021.01.07.01.03.0
0081.09.03.01.03.0
),(.
sum
ABB
ABA
AAB
AAA
XaPabseqState i
2012/10/24 Prof. Satoshi Nakamura 107
An example
Step 2:
34.09.03.01.07.0)|(
66.01.03.09.07.0)|(
66.09.07.01.03.0)|(
34.01.07.09.03.0)|(
BbP
BaP
AbP
AaP
9264.034.0log34.066.0log66.0)|(
9264.066.0log66.034.0log34.0)|(
BXH
AXH
)60.066.0log,56.134.0(log
2
1)()( BPAP
9264.0)()|()()|()( BPBXHAPAXHXH
Now we have Entropy for the HMM,
2012/10/24 Prof. Satoshi Nakamura 108
An example
Step 3: Parameter estimation of the HMM.
426.0
0189.0
426.0
3969.02
426.0
0021.02
426.0
0081.0426.0
0021.02
426.0
0081.0
.
.ˆ
Afromtransitionwithsequencesstateofprob
AAtransitionwithsequencesstateofprobaAA
4426.00021.020081.0
0081.0
.
a"".ˆ
AAtransitionwithsequencesstateprob
symbolproducingAAtransitionwithsequencesstateprobbAA
2012/10/24 Prof. Satoshi Nakamura 109
An example
There is another way of estimation using
.
time
symbol
,
021.0405.0
1.03.0)(9.07.0)(
1.07.0)()(9.03.0)()(2
63.003.0
9.07.0)()(1.03.0)()(1
0.0)(0.1)(0
Symbol
11
1212
0101
00
aBB
ABAA
bABAA
BA
Time
426.0)(2 iS
426.0
9.07.0)(0
1.03.0)()(
66.034.0
1.03.0)(1.07.0)(1
9.07.0)()(9.03.0)()(
0.1)(0.1)(2
1
10
22
2121
22
bB
AA
aBB
ABAA
BA
2012/10/24 Prof. Satoshi Nakamura 110
An example
ab
AxbaAAxbaA
AxbaAab
a
AAAA
AxbaAAxbaAa
ab
tAAAAtAAAA
tAAAAAA
tAAAAtAAAAAA
4426.0)()()()()()(
)()()()(ˆ
0420.0)()()()(
)()()()()()(ˆ
2110
21
1100
2110
2012/10/24 Prof. Satoshi Nakamura 111
An example
)21(0.0)|(
)20(0.10.19545.00.10455.0)|(
)19(9766.09950.09580.05574.00420.0)|(
)18(0234.00050.09580.04426.00420.0)|(
BbP
BaP
AbP
AaP
2012/10/24 Prof. Satoshi Nakamura 112
An example
)24(0)|(
)23(1601.0
)22(9766.0log9766.00234.0log0234.0)|(
BXH
AXH
1, iZZZP
)(0455.09545.0
9580.00420.0)( ,, BABA ZZZZ
501.0,499.0 BA ZZ
)26(0799.0499.01601.0
)25()()|()()|()(
BPBXHAPAXHXH
First goal of 3rd day
2012/10/24 Prof. Satoshi Nakamura 113
2012/10/24 Prof. Satoshi Nakamura 114
Some properties of codes
Definition: Let the set of symbols comprising a given alphabet be called
S={s1,s2,…,sq}. Then we define a code as a mapping of all possible sequences of
symbols of S into sequences of symbols of some other alphabet X={x1,x2,…,xr}.
We call S the source alphabet and X the code alphabet.
),.....,,(21
3
2
1
3
2
1
jxixixi
iX
iS
x
x
x
x
S
S
S
S
XS
rq
Information
source
Source
alphabet
Code
alphabet
Code word
2012/10/24 Prof. Satoshi Nakamura 115
Classification of coding
Non-block
code
Block
code
Singular
code
Nonsingular
code
Uniquely
undecodable
code
Uniquely
decodable
code
Noninstantaneous
code
Instantaneous
code
2012/10/24 Prof. Satoshi Nakamura 116
Block code
Definition: A block code is a code which maps each of the symbols
of the source alphabet S into a fixed sequence of symbols of the
code alphabet X. These fixed sequences of the code alphabet
(sequences of xj) are called code words. We denote the code word
corresponding to the source symbol si by Xi. Note that Xi denotes a
sequence of xj’s.
Source symbols code
S1 0
S2 11
S3 00
S4 01
2012/10/24 Prof. Satoshi Nakamura 117
Nonsingular block code
Definition: A block code is said to be nonsingular if all the words of
the code are distinct.
It is still possible for a given sequence of code symbols to have an
ambiguous origin. For example, the sequence 0011 might represent
either s3s2 or s1s1s2.
Source symbols code
S1 0
S2 11
S3 00
S4 01
2012/10/24 Prof. Satoshi Nakamura 118
Extension of block code
Definition: The nth extension of a block code which maps the
symbols si into the code words Xi is the block code which maps the
sequences of source symbols (si1, si2, …, sin) into the sequences of
code words (Xi1,Xi2,…,Xin).
Source symbols code Source symbols code
S1S1 00 S3S1 000
S1S2 011 S3S2 0011
S1S3 000 S3S3 0000
S1S4 001 S3S4 0001
S2S1 110 S4S1 010
S2S2 1111 S4S2 0111
S2S3 1100 S4S3 0100
S2S4 1101 S4S4 0101
2012/10/24 Prof. Satoshi Nakamura 119
Uniquely decodable code
Definition: A block code is said to be uniquely decodable if, and only if, the nth extension of the code is nonsigular for every finite n.
Any two sequences of source symbols of the same length are distinct sequences of code symbols, if the code is uniquely decodable.
Two sequences of the different length should also be distinct, if the code is uniquely decodable.
Suppose we have source symbol sequences S1 and S2 which lead to the same sequence of code symbols, Xo, and S1 and S2 may be sequences of source symbols of different lengths. Now let us form two new sequence source symbols, S1’ and S2’, where S1’=S2S1, S2’=S1S2. Both of S1’ and S2’ are sequence X0 followed by X0 with the same length. Thus, the code doesn’t satisfy the condition of unique decodability.
2012/10/24 Prof. Satoshi Nakamura 120
Instantaneous code
Code A : This code is uniquely decodable, since all codes have the same length and distinct.
Code B : This code is also uniquely decodable, since it is non-singular. It is called “Comma code”, which separates code by comma, 0 in this example.
Code C : This code is also uniquely decodable. However, we are not able to decode the sequence, word by word, as it is received. We can decode only after receiving 0 of the next code word.
Source symbol Code A Code B Code C
S1 00 0 0
S2 01 10 01
S3 10 110 011
S4 11 1110 0111
2012/10/24 Prof. Satoshi Nakamura 121
Instantaneous code
Definition: A uniquely decodable code is said to be instantaneous if it is possible to decode each word in a sequence without reference to succeeding code symbols.
Code A and code B are instantaneous. However, code C is not instantaneous. A more general method to know whether instantaneous or not would be helpful.
Definition: Let Xi=xi1xi2…xim be a word of some code. The sequence of code symbols (xi1xi2…xij), where , is called a prefix of the code word Xi.
Ex. 0,01,011,0111 are prefixes of 0111.
mj
2012/10/24 Prof. Satoshi Nakamura 122
Instantaneous code
A necessary and sufficient condition for a code to be instantaneous
is that no complete word of the code be a prefix of some other
code word,.
Sufficient part:
If no word is the prefix of some other word, we may decode any
received sequence of code symbols comprised of code words in a
direct manner.
We scan the received sequence of code symbols until we come to a
subsequence which comprises a complete code word.
The subsequence must be this code word since by assumption it is
not the prefix of any other code word.
2012/10/24 Prof. Satoshi Nakamura 123
Instantaneous code
Necessary part:
We assume that there exists some word of our code, say Xi, which is
also a prefix of some other word Xj.
Now, if we scan a received sequence of code symbols and come upon
the subsequence Xi, this subsequence may be a complete word, or it
may be just the first part of word Xj.
We cannot possibly tell which of these alternatives is true, however,
until we examine more code symbols of the main sequence-thus the
code is not instantaneous.
Non-block
code
Block
code
Singular
code
Nonsingular
code
Uniquely
undecodable code
Uniquely
decodable code
Noninstantaneous
code
Instantaneous
code
2012/10/24 Prof. Satoshi Nakamura 124
Construction of an Instantaneous code
Example code synthesis:
Assign 0 to symbol s1:
If we assign 1 to symbols s2, this would
leave us with no symbols. we might have,
This, in turn, would require us to start
remaining code words with 11. If ,
then the only three-binit prefix still unused is 111.
And we might set,
and
Other alternatives:
If we synthesize another binary instantaneous code.
Then we may set.
We still have two prefixes of length 2 unused.
01 s
102 s
1103 s
11104 s
11114 s
001 s
012 s
103 s
1104 s
1115 s
2012/10/24 Prof. Satoshi Nakamura 125
Kraft inequality
Constraints on the size of words of an instantaneous code.
Consider an instantaneous code with source alphabet,
and code alphabet X={x1,x2,…,xr}. Let the code words be
X1,X2,…,Xq and define the length (number of code symbols) of
word Xi as li. It is often desirable that the lengths of the code words
of our code be as small as possible. Necessary and sufficient
conditions for the existence of an instantaneous code with word
lengths l1,l2,…,lq are provided by the Kraft inequality.
Kraft inequality: A necessary and sufficient condition for the
existence of an instantaneous code with word lenghts l1,l2,…,lq is
that
where r is the number of different symbols in the code alphabet.
},,{ 1 qssS
11
q
i
lir
2012/10/24 Prof. Satoshi Nakamura 126
Kraft inequality
121
q
i
li
For the binary case, the Kraft inequality tell us that the li
must satisfy the equation.
Source symbols Code A Code B Code C Code D Code E
S1 00 0 0 0 0
S2 01 100 10 100 10
S3 10 110 110 110 110
S4 11 111 111 11 11
2012/10/24 Prof. Satoshi Nakamura 127
Kraft inequality
Code A:
Kraft inequality does not tell that code A is an instantaneous code. The inequality is merely a condition on the word lengths of the code and not on the words themselves.
Code B:
Code C:
Code D:
Code D is not an instantaneous code.
Code E: Code E is not an instantaneous code.
122222 224
1
22
i
li
18
722222 33
4
1
31
i
li
122222 334
1
21
i
li
122222 234
1
31
i
li
8
1122222 23
4
1
21
i
li
2012/10/24 Prof. Satoshi Nakamura 128
One more example
Suppose we wish to encode the outputs of a decimal source, S={0,1,2,…,9}, into a binary instantaneous code. Suppose there is some reason for encoding the 0 and 1 symbols of the decimal source into relatively short binary code words. If we were to encode 0s and 1s from the source as,
If we require all these eight code words to be of the same length, say l, the Kraft inequality will provide us with a direct answer to the equation.
By assumption we have l0=1,l1=2, and l2=l3=…=l9=l. Then,
or
101
00
9
0
12i
li
1)2(84
1
2
1 l
5l
2012/10/24 Prof. Satoshi Nakamura 129
The Kraft inequality - Proof
First we prove that the inequality is sufficient for the existence of
an instantaneous code by actually constructing an instantaneous
code, satisfying
(1) can be written as,
on multiplying by rL,
rearranging terms,
dividing by r,
iterate the operation,
)1(11
ilq
i
r qni
L
i
L
L
LL
LLi
i
L
i
i
i
L
i
rnrnrn
rrn
rn
...
1
2
2
1
1
1
1
rnrnrn
rnrnrrnnrn
rnrnrnrn
L
LL
L
L
LL
LLL
L
LLL
L
2
2
1
1
1
2
2
1
111
1
2
2
1
1
...
...
...
rn
nrrrnrn
rnrnrrnrnrn
1
11
2
2
212
2
1
3
3
)(
))((
L is largest of li
0Ln
2012/10/24 Prof. Satoshi Nakamura 130
The Kraft inequality - Proof
Steps:
We assign n1 word of length 1.
There are r possible such words that we may form, using an r-symbol
code alphabet.
We can select these n1 code symbols arbitrarily,
We are then left with r-n1 permissible prefixes of length 1.
By adding one symbol to the end of each of these permissible
prefixes, we may form as many as,
words of length 2.
As before, we choose our n2 words arbitrarily
among our r2-n1r choices, we are left with,
unused prefixes of length 2, from which we may form
permissible prefixes of length 3.
rn 1
rnrrnr 1
2
1)(
21)( nrnr
rnrnrrnrnr 2
2
1
3
21
2 )(
2012/10/24 Prof. Satoshi Nakamura 131
McMillan’s inequality
Proof for the necessity conditions for uniquely decodable codes ?
Consider the quantity, we have qn terms, each terms of
If we let L be the maximum of the word length li.
We define Nk as the number of terms of the form r-k, then,
Nk is also the number of strings of n code words that can be formed so that each string has a length of exactly k code symbols.
If the code is uniquely decodable, Nk must be no greater than rk, the number of distinct r-ary sequences of length k. Thus, we have
Bernulli’s inequality: For x>1, n is arbitrarily large, holds. Considering this inequality and equation (*), we can prove,
nlllnlq
iqi rrrr )...()( 21
1
,...
321 kllllrr niiii
....21 niii lllk
nLkn
)()( 1
k
k
nL
nk
nlq
i rNr i
).(1
)(1
nLnnL
rrrnL
nk
kknq
i
li
nlxn
11
ilq
i r
2012/10/24 Prof. Satoshi Nakamura 132
Example
Assume we wish to encode a source with 10 source symbols into a trinary
instantaneous code with word length 1,2,2,2,2,2,3,3,3,3.
Applying the test of the Kraft inequality, we have,
This doesn’t satisfy the inequality.
Assume we with to encode symbols from a source with nine symbols into
a trinary instantaneous code with lengths 1,2,2,2,2,2,3,3,3. Applyint the
test of the Kraft inequality, we have,
We show the example.
127
28
27
14
9
15
3
13
10
1
i
li
1
27
13
9
15
3
13
9
1
i
li
222,221,220
,21,20,12
,11,10,0
987
654
321
sss
sss
sss
2012/10/24 Prof. Satoshi Nakamura 133
Coding information sources
For a given source alphabet and a given code alphabet, however,
we can construct many instantaneous codes forces us to find a
criterion by which we may choose among the codes.
Perhaps the natural criterion for this selection, although by no
means the only possibility, is length.
Definition: Let a block code transform the source symbols
s1,s2,…,sq into the code words X1,X2,..,Xq. Let the probabilities of
the source symbols be P1,P2,…,Pq, and let the lengths of the code
words be l1,l2,…,lq. Then we define L, the average length of the
code, by the equation
q
i
iilPL1
2012/10/24 Prof. Satoshi Nakamura 134
Coding information source
Average length and Entropy:
Definition: Consider an instantaneously decodable code which
maps the symbols from a source S, s1,s2,…,sq with probabilities
P1,P2,…,Pq into code word composed of symbols from an r-ary
code alphabet. We have the following relationships.
Compact code:
Definition: Consider a uniquely decodable code which maps the
symbols from a source S into code word composed of symbols
from an r-ary code alphabet. This code will be called compact
(for the source S) if its average length is less than or equal to the
average length of all other uniquely decodable codes for the same
source and the same code alphabet.
LSH
rLSH
r
)(
log)(
2012/10/24 Prof. Satoshi Nakamura 135
Compact code
Proof of the relationship:
Consider a zero-memory source S, with symbols s1,s2,…,sq and symbol
probabilities P1,P2,…,Pq, respectively. Let a block code encode these
symbols into a code alphabet of r symbols, and let the length of the
word corresponding to si be li. Then the entropy of this zero-memory
source is,
Let Q1,Q2, …,Qq be any q numbers such that for all if and
By the Jensen’s inequality, we know that
with equality if and only if Pi=Qi for all i. Hence,
q
i
ii PPSH1
log)(
0iQ
Q
i
iQ1
.1
q
i i
i
q
i ii
QP
PP
11
1log
1log
)1(log)(1
q
i
ii QPSH
2012/10/24 Prof. Satoshi Nakamura 136
Compact code
Equation is valid for any set of nonnegative numbers Qi which sum
to 1. We may choose,
We obtain,
ljq
i
li
ir
rQ
1
LSHorLr
SH
rLlPr
rPrPSH
r
i
q
i
i
q
j
ljq
i
i
liq
i
i
)(,log
)(
loglog
)2()(log)(log)(
1
111
0
1 1
2012/10/24 Prof. Satoshi Nakamura 137
Compact code
A method of encoding for special source.
Considering eqns. (1)(2), a condition for equality in the last
inequality is,
Then we see that a necessary and sufficient condition for equlality is,
or
11
q
j
l jr
.
1
iallforr
r
r
QP
i
j
i
l
q
j
l
l
ii
)9.4(1
log biallforlP
i
i
r
2012/10/24 Prof. Satoshi Nakamura 138
Compact code
We may say that, for an instantaneous code and a zero-memory
source, L must be greater than or equal to Hr(S). Furthermore, L
can achieve this lower bound if and only if we can choose the word
lengths li equal to logr (1/Pi) for all i. For the equality, therefore,
log r (1/Pi) must be an integer for each i.
In other words, for the equality the symbol probabilities Pi must all
be of the form (1/r)ai, where ai is an integer.
Note that if these conditions are met, we have derived the word
lengths of a compact code. We simply choose li equal to ai.
2012/10/24 Prof. Satoshi Nakamura 139
Compact code
Source symbol Symbol prob. code
S1 1/2 0
S2 1/4 10
S3 1/8 110
S4 1/8 111
il
iP
2
1
4
31
4
1
i
i
ilPL
4
31
1log
4
1
ii
iP
PH
2012/10/24 Prof. Satoshi Nakamura 140
Example: Compact code
Source symbol Symbol prob. code
S1 1/4 00
S2 1/4 01
S3 1/4 10
S4 1/4 11
Source symbol Symbol prob. code
S1 1/2 0
S2 1/4 10
S3 1/8 110
S4 1/8 111
2012/10/24 Prof. Satoshi Nakamura 141
Example: Compact code
Source symbol Symbol prob. code
S1 1/3 0
S2 1/3 1
S3 1/9 20
S4 1/9 21
S5 1/27 220
S6 1/27 221
S7 1/27 222
2012/10/24 Prof. Satoshi Nakamura 142
Shannon’s first theorem
We now turn to zero-memory source with arbitrary symbol probabilities.
Equation (4-9b) tells us that if logr (1/Pi) is an integer, we should choose
the word length li equal to this integer. If log r (1/Pi) is not an integer, it
might seem reasonable that a compact code could be found by selecting
li as the first integer greater than this value. This tempting conjecture is, in
fact , not valid, but we shall find that selecting li in this manner can lead to
some important results.
First, we check to see that the word lengths satisfy the Kraft inequality.
Summing (4-11) over all i, we obtain,
)104(11
log1
log i
ri
i
rP
lP
)114(1
ii l
i
l
i
rPorrP
q
i
lir1
1
2012/10/24 Prof. Satoshi Nakamura 143
Shannon’s first theorem
If we multiply (4-10) by Pi and sum over all i,
In this way, if we construct the code in the way of (4-10), we can have
the lower and upper bounds of L. This is valid for any zero-memory
source, we may apply it to the nth extension of our original source S.
Ln represents the average length of the code words corresponding to
symbols from the nth extension of the source S. If is the length of the
code word corresponding to symbol and, is the probability
of , then
Ln/n is the average number of code symbols used per single symbol from
S.
)124(.1)()( SHLSH rr
)134(.1)()( n
rn
n
r SHLSH
ii )( iP
i )144()(1
nq
i
iin PL
)154(1
)()( an
SHn
LSH rr
2012/10/24 Prof. Satoshi Nakamura 144
Shannon’s first theorem
It is possible to make Ln/n as close to Hr(S) as we wish by coding
the nth extension of S rather than S:
Equation (4-15a) is known as Shannon’s first theorem or the
noiseless coding theorem. The price we pay for decreasing Ln/n is
the increased coding complexity caused by the large number (qn) of
source symbols.
)154()(lim bSHn
Lr
n
n
2012/10/24 Prof. Satoshi Nakamura 145
Shannon’s first theorem for Markov source
We define the first-order Markov source S, with source symbols
s1,s2,…,sq and conditional symbols probabilities P(si/sj). We also
define Sn, the nth extension of S, with symbols ,
and conditional symbols probabilities P( ). We refer to the first-
order (unconditional) symbol probabilities of S and Sn as Pi and
P( ), respectively.
The process of encoding the symbols s1,s2,…,sq into an
instantaneous block code is identical for he source S and its adjoint
source . If the length of the code word corresponding to si is li,
the average length of the code is,
nq ,,, 21
ji /
ji /
S
q
i
iilPL1
2012/10/24 Prof. Satoshi Nakamura 146
Shannon’s first theorem for Markov source
The average length is identical for S and since Pi, the first-order symbol probability of si, is the same for both these sources. is a zero-memory source, and we have,
This inequality may be augmented to read,
and,
If we now select the li according to (4-10), we may bound L above and below (4-12),
for the extended source,
using (2-41) and dividing by n,
S
S
LSHr )(
LSHSH rr )()(
n
n
r
n
r LSHSH )()(
1)()( n
rn
n
r SHLSH
n
SHSHSH
n
L
n
SHSHSH rr
rnrr
r
1)()()(
)()()(
1)()( SHLSH rr
2012/10/24 Prof. Satoshi Nakamura 147
Coding without extensions
Shannon’s theorem shows the bound above and below considering
its extension. The theorem doesn’t tell us what value of L (or Ln/n)
we shall obtain. It doesn’t guarantee that choosing the word lengths
according to (4-10) will give us the smallest possible value of L ( or
Ln/n) it is possible to obtain for that fixed n.
Source symbol Pi Log 1/Pi li Code A Code B
S1 2/3 0.58 1 0 0
S2 2/9 2.17 3 100 10
S3 1/9 3.17 4 1010 11
symbolsourcebinitsLA /78.149
13
9
21
3
2 symbolsourcebits
PPSH
i i
iA /22.11
log)(3
1
1)()( SHLSH AAAThis satisfies,
However, code B gives shorter L. symbolsourcebinitsLB /33.129
12
9
21
3
2
2012/10/24 Prof. Satoshi Nakamura 148
Binary Compact Codes – Huffman Codes
A compact code for a source S is a code which has the smallest average length possible if we encode the symbols from S one at a time. We develop a method of constructing compact codes for the case of a binary code alphabet.
Consider the source S with symbols s1,s2,…,sq and symbol probabilities P1,P2,…,Pq. Let the symbols be ordered so that . By regarding the last two symbols of S as combined into one symbol, we obtain a new source from S containing only q-1 symbols. We refer to this new source as a reduction of S.
The symbols of this reduction of S may be reordered, and again we may combine the two last least probable symbols to form a reduction of this reduction of S. By proceeding in this manner, we construct a sequence of sources, each containing one fewer symbol than the previous one, until we arrive at a source with only two symbols.
qPPP 21
2012/10/24 Prof. Satoshi Nakamura 149
Huffman codes
0.04 s6
0.1 0.06 s5
0.1 0.1 0.1 s4
0.3 0.2 0.1 0.1 s3
0.4 0.3 0.3 0.3 0.3 s2
0.6 0.4 0.4 0.4 0.4 s1
S4 S3 S2 S1 Prob. Symbols
Original Source Reduced Source
Construction of a sequence of reduced sources is the first step in the construction of a compact instantaneous code for the original source S.
The second step is merely the recognition that a binary compact instantaneous code for the last reduced source ( a source with only two symbols) is the trivial code with the two words 0 and 1.
The final step is to construct a compact instantaneous code for the source immediately preceding the reduced source in the sequence of reduced sources.
2012/10/24 Prof. Satoshi Nakamura 150
Huffman codes
15
1
i
i
ilPL
8113.01
log5
1
ii
iP
PH
Huffman codes for two symbols
Symbols Prob. Code
S1 0
S2 1
43
41
2012/10/24 Prof. Satoshi Nakamura 151
Huffman codes
We assign to each symbol of Sj-1 (sa0 and sa1) the code word used by the
corresponding symbol of Sj. The code words used by sa0 and sa1 are
formed by adding a 0 and 1, respectively, to the code word used for sa.
There are another possibilities to decompose a reduced source in code S3
and S1.
Synthesis of a compact code
0.1
0.2
0.3
0.4
S2
0101
0100
011
00
1
Code
0.1
0.1
0.1
0.3
0.4
S1
01011
01010
0100
011
00
1
Code
0.04
0.06
0.1
0.1
0.3
0.4
Prob.
s6
s5
s4
s3
s2
s1
symbols
011
01 0.3 010
1 0.4 00 0.3 00
0 0.6 1 0.4 1
Code 4 Code S3 Code
2012/10/24 Prof. Satoshi Nakamura 152
Huffman codes
There are three choices in S1. If we choose the fist one, we obtain a code
with word lengths ,
1, 2, 4, 4, 4, 4.
If we choose the second or third, we obtain,
1, 2, 3, 4, 5, 5.
Synthesis of compact codes
0.1
0.2
0.3
0.4
S2
0101
0100
011
00
1
Codes
0.1
0.1
0.1
0.3
0.4
S1
0111
0110
0101
0100
00
1
Codes
0.04
0.06
0.1
0.1
0.3
0.4
Prob.
s6
s5
s4
s3
s2
s1
Symbols
011
01 0.3 010
1 0.4 00 0.3 00
0 0.6 1 0.4 1
Codes S4 Codes S3 Codes
875.15
1
i
i
ilPL 8402.11
log5
1
ii
iP
PH
2012/10/24 Prof. Satoshi Nakamura 153
Huffman codes
Two codes have the same average code lengths. These are shortest
average length codes that can construct.
symbolbinitsL
symbolbinitsL
/2.2)04.0(5)06.0(5)1.0(4)1.0(3)3.0(2)4.0(1
/2.2)04.0()06.0(4)1.0(4)1.0(4)3.0(2)4.0(1
1435.21
log6
1
ii
iP
PH
Synthesis of compact code
111
110
10
0
code
0.125
0.125
0.25
0.5
S1
1111
1110
110
10
0
code
0.025
0.100
0.125
0.25
0.5
Prob.
s5
s4
s3
s2
s1
Symbols
Compact code
15
1
i
i
ilPL 8113.01
log5
1
ii
iP
PH
2012/10/24 Prof. Satoshi Nakamura 154
Proof of Huffman codes
Assume that we have found a compact code Cj for some reduction, say Sj, of
an original source S. Let the average length of this code be Lj.
One of the symbols of Sj, say sa, is formed from the two least probable
symbols of the preceding reduction Sj-1. Let these two symbols be sa0 and sa1,
and let their probabilities be Pa0 and Pa1, respectively.
The probability of sa is then Pa=Pa0+Pa1. Let the code for Sj-1 formed
according to rule (4-24) be called Cj-1, and let its average length be Lj-1.
Lj-1 is easily related to Lj since the words of Cj and Cj-1 are identical except
that the (two) words for sa0 and sa1 are one binit longer than the (one) word
for sa. Thus we know that
What we want to show is if Cj is compact, then Cj-1 must also be compact. In
other words, if Lj is the smallest possible average length of an instantaneous
code for Sj, then Lj-1 is the smallest possible average length for Sj-1.
)25.4(101 PPLL jj
2012/10/24 Prof. Satoshi Nakamura 155
Proof of Huffman codes
10
10
1
10
1
1
1100
1
1
1
)1()1(
PPL
PPlP
lPlPlP
lPlPlPL
j
i
k
i
i
i
k
i
i
i
k
i
ij
101 , PPPlPL ii
k
ij
where,
2012/10/24 Prof. Satoshi Nakamura 156
Proof of Huffman codes
A proof by demonstrating that assuming the contrary leads to a contradiction.
Assume that we have found a compact code for Sj-1 with average length .
Let the words of the code be with lengths respectively.
We assume that the subscripts are ordered in order of decreasing symbol
probabilities so that,
One of the words of this code (call it ) must be identical with except in its
last digit. If this were not true, we could drop the last digit from and decrease
the average length of the code without destroying its instantaneous property.
Finally, we form , a code for Sj, by combining and and dropping their
last binit while leaving all other words unchanged. This gives us an instantaneous
code for Sj with average length , related by
101
~~ PPLL jj
11
~ jj LL
,~
,,~
,~
121 aXXX ,~
,,~
,~
121 alll
121
~~~alll
0
~aX
1
~aX
1
~aX
jC~
1
~aX
0
~aX
2012/10/24 Prof. Satoshi Nakamura 157
Proof of Huffman codes
If we compare the last equation to (4-25), we see that our assumption
implies that we may construct a code with average length
This is the contradiction we seek since the code with average length Lj is compact.
Two properties of Huffman codes.
If the probabilities of the symbols of a source are ordered so that
, the lengths of the words assigned to these symbols will be ordered so that,
The lengths of the last two words ( in order of decreasing probability) of a compact
code are identical:
If there are several symbols with probability Pq, we may assign their subscripts so that
the words assigned to the last two symbols differ only in their last digit.
11
~ jj LL
jj LL ~
qPPP 21
qlll 21
1 qq ll
2012/10/24 Prof. Satoshi Nakamura 158
r-ary compact codes
We would like the last source in the sequence to have exactly r symbols. The last source will have r symbols if and only if the original source has r+a(r-1) symbols, where a is an integer. Therefore, if the original source doesn’t have r+a(r-1) symbols, we add “dummy symbols” with probability 0 to the source until this number is reached.
s7
Synthesis of compact codes
102 0.00 (s12 )
02 0.10 03 0.08 03 0.08 s6
01 0.10 02 0.10 02 0.10 s5
3 0.15 00 0.12 01 0.10 01 0.10 s4
2 0.22 3 0.15 00 0.12 00 0.12 s3
1 0.23 2 0.22 3 0.15 3 0.15 s2
0 0.40 1 0.23 2 0.22 2 0.22 s1
0.08
S2
13
12
11
10
Codes
0.05
0.05
0.06
0.07
S1
103
101
100
13
12
11
Codes
0.00
0.03
0.04
0.05
0.05
0.06
Prob.
(s13 )
s11
s10
s9
s8
Symbols
03
Codes S3 Codes
Dummy
symbols
2012/10/24 Prof. Satoshi Nakamura 159
Code efficiency and redundancy
Shannon’s first theorem shows that there exists a common measure for any
information source. The value of a symbol from an information source S may be
measured in terms of an equivalent number of binary digits needed to represent
one symbol from that source.
Let the average length of a uniquely decodable r-ary code for the source S be L. L
cannot be less than Hr(s). Accordingly, we define the efficiency of the code, by
It is also possible to define the redundancy of a code.
.)(
L
SH r
.)(
1
L
SHL r
Redundancy
2012/10/24 Prof. Satoshi Nakamura 160
Example – nth extension
The average length of this code is 1 binit, so the efficiency is,
To improve the efficiency, we might code S2, the second extension of S:
Extending to higher order,
Huffman codes for two symbols
Symbols Prob. Code
S1 3/4 0
S2 1/4 1
bitSH 811.03
4log
4
34log
4
1)(
.811.0
Huffman codes for two symbols
Symbols Prob. Code
S1 9/16 0
S2 3/16 10
S3 3/16 110
S4 1/16 111
985.02
991.0,985.0 43
2012/10/24 Prof. Satoshi Nakamura 161
Example – nth extension
2012/10/24 Prof. Satoshi Nakamura 162
Compact codes: Elias codes
011 is a point of region [0.375,0.50] . An initial symbol is A.
0110 is a point of region [0.375,0.4375]. The source symbols are AAB.
Elias code:
Elias codes is non-block compact codes in contrast to the Huffman codes, which
are the block codes. This is also called arithmetic codes.
Elias code assign a sequence of source symbols to a fractional number, which is
obtained by dividing a number line according to the symbol probabilities.
2-ary coding
2012/10/24 Prof. Satoshi Nakamura 163
Elias code
In Huffman codes it is necessary to consider extension of codes in order to
improve code efficiency. If the block size is large, it becomes difficult.
Also in Huffman codes code length should be an integer number.
Elias code assigns a sequence of source symbols to one code. It is not necessary to
calculate all of probabilities of nth extension of symbols and we can decode the
codes iteratively.
2012/10/24 Prof. Satoshi Nakamura 164
Elias code
Procedure:
Suppose we have binary codes s0 and s1 with probabilities P0 and P1.
Divide a region of number line [0,1) according to P0:P1 and make a
region A0 and A1. A0 corresponds [0,P0), A1 corresponds [P0,1).
If a first source symbol, S0 is s0 then choose a region A0, else choose a
region A1.
If S0=s0 and a region A0 is selected, divide a region A0 according to
P0:P1 and obtain a region A00 and A01.
Then a next code S1=s0, then choose a region A00, else A01.
Iterate this procedure until the end of the source symbol sequence
and represent a chosen region with a fractional number, which is
lower value of the region.
2012/10/24 Prof. Satoshi Nakamura 165
Elias code
Average code length: The size of the region for symbol sequence SN
becomes,
where letting number of s0, s1 be N0, N1, respectively.
The necessary resolution to represent a point in this region with binary
fraction number is,
If we take longer source symbol length N,
in this way the average code length approaches to the Entropy according
to the length N.
10
10
NNPP
121020 loglog PNPN
N
NP
N
NP
NN 11
00
10 ,
2012/10/24 Prof. Satoshi Nakamura 166
Elias code
Source symbol
This figure depicts the process where
the source symbol sequence 010011..
is encoded by the Elias code.
First a region [0,1) is divided into A0
A1 according to P0:P1.
A0 is chosen since a first symbol is 0.
In this way the subregion is divided
and chosen.
2012/10/24 Prof. Satoshi Nakamura 167
L-R arithmetic codes
Problems of Elias code:
Multiplication by a probability per coding one source symbol is necessary.
Required precision for calculation increases according to N.
Coding cannot be started until receiving the last symbol.
L-R arithmetic code: One approach to solve the problem for binary code.
Approximate an inferior symbol probability by 2-Q.
Assign a value U of the region [U,V) to the symbol sequence.
Prevent bit-reverse propagation by carry introducing bit-stuffing.
Average code length:
An average code length of L-R code is given by,
Coding efficiency becomes 1 if an inferior symbol probability is 2-Q.
)21(log2log 2021
QQ PPL
2012/10/24 Prof. Satoshi Nakamura 168
L-R arithmetic code
Coding algoritm:
Initialization:
Prepare a register C and a register A with V bits.
C is an initial code and A is an initial value of the region.
Coding of source symbol Xi.
Divide the register A into A0 and A1 according P0:P1 of the superior symbol
“0” and the inferior symbol “1”.
(Q: integer, called SKEW) (1) is calculated by right shift and (2) can be
calculated by A-A1=A0
0...000C
1...111A
)1(11 PAA
)2(00 PAA
QP 21
2012/10/24 Prof. Satoshi Nakamura 169
L-R arithmetic code
Code is,
update the region,
,where C represents the lower bound of the chosen region.
0iX
1iX 0ACC
If
If
C is same as it was.
0iX 0AA
1iX1AA
If
If
2012/10/24 Prof. Satoshi Nakamura 170
L-R arithmetic code
Decoding algorithm:
Initialization:
C copied from the received codes.
A the initial value set by the coding algorithm
Decoding:
Every time we receive a code, divide the region A.
For registers,
If C-A0 is negative, keep C as it was and choose source symbol 0.
If C-A0 is non-negative, set C C-A0, and choose source symbol 1.
Next update A,
00 PAA
11 PAA
0iX 0AA1iX
1AA
2012/10/24 Prof. Satoshi Nakamura 171
Another advantage of L-R arithmetic code
We can change a inferior probability, SKEW, according to change
of a symbol probability. If we use the same SKEW in decoding, we
can decode in the same way.
2012/10/24 Prof. Satoshi Nakamura 172
Coding example by L-R code
Sym. A A0 A1 Code Output C
0 1111 1100 0011 0000
1 1100 1001 0011 1001
ren. 0011 Shift 2 bit 10 01
0 1100 1001 0011 10 0100
0 1001 0111 0010 10 0100
ren. 0111 Shift 1 bit 100 100
1 1110 1011 0011 101 0011
ren. 0011 Shift 2 bit 10100 11
1 1100 1001 0011 10101 0101
ren. 0011 Shift 2 bit 1010101 01
0 1100 1001 0011 1010101 0100
0 1001 0111 0010 1010101 0100
ren. 0111 Shift 1 bit 10101010 100
0 1110 1011 0011 10101010 1000
1 1011 1001 0010 10101011 0001
Code string = 101010110001
2012/10/24 Prof. Satoshi Nakamura 173
Decoding example of L-R code
A A0 A1 C Code String Sym.
1111 1100 0011 1010 10110001 0
1100 1001 0011 0001 10110001 1
0011 Shift 2 bit 0110 110001 ren.
1100 1001 0011 0110 110001 0
1001 0111 0010 0110 110001 0
0111 Shift 1 bit 1101 10001 ren.
1110 1011 0011 0010 10001 1
0011 Shift 2 bit 1010 001 ren.
1100 1001 0011 0001 001 1
0011 Shift 2 bit 0100 1 ren.
1100 1001 0011 0100 1 0
1001 0111 0010 0100 1 0
0111 Shift 1 bit 1001 ren.
1110 1011 0011 1001 0
1011 1001 0010 0000 1
Decoded symbol string=0100110001
2012/10/24 Prof. Satoshi Nakamura 174
Bit-stuffing- L-R code
(a) Without bit-stuffing
Code output Register C
Bit reverse by carry
already output
Code output Register C
Register A0 already output Register A0
Register C
Register C Register P
Register P Code output
Code output
Bit “0”
insertion
(a) With bit-stuffing
No influence by the bit reverse
2012/10/24 Prof. Satoshi Nakamura 175
Coding efficiency of L-R code
Probability of an inferior symbol
Co
din
g ef
fici
ency
2012/10/24 Prof. Satoshi Nakamura 176
Universal code
What is universal code?
Coding which can compress source symbols belong to a fixed class,
optimally or very efficiently.
Coding algorithm independent of a prior probabilities of source
symbols. Or coding algorithm for source symbols which have varying
probabilities.
Three coding algorithms:
Adaptive Huffman code
Context Modeling
Dictionary code
2012/10/24 Prof. Satoshi Nakamura 177
Adaptive Huffman code
Adaptive Huffman code (1)
Algorithm:
Every time when we receive N source symbols (one block), update a
probability table of source symbols and re-synthesis Huffman codes. Then
send them to the decoder.
Problems:
According to the size of a block the size of the probability table seems
relatively small, however, we cannot send a code until N source symbols.
It is very inefficient to re-synthesis Huffman code every N symbols.
2012/10/24 Prof. Satoshi Nakamura 178
Adaptive Huffman code
Adaptive Huffman code (2) Algorithm:
Code a source symbols and send the code based on the Huffman codes designed by a prior symbol probabilities.
Let probabilities of source symbols a0,a1,…,aM-1 at time N-1 be,
If we let a source symbol at time N be a, the code for the symbol is synthesized by Huffman codes based on the symbol probability,
This algorithm doesn’t need to send a probability table of source symbols since a decoder can update the probability table simultaneously.
Problems: In worst case an update of Huffman codes will be necessary for each symbol. Higher resolution is necessary according to the size N.
1
)()( 1
1
N
anaP iN
iN
N
apN
N
anaP NN
N
1)()1()()( 1
aaifN
apN
N
anaP i
iniNiN
)()1()(
)( 1
2012/10/24 Prof. Satoshi Nakamura 179
Adaptive Huffman code
Adaptive Huffman code (3)
Algorithm:
Update the probability table only when the tree of Huffman codes changed. The timing
of the table update is calculated based not on the true symbol probabilities but on the
following approximated probabilities.
Normalization in this equation will not be applied in reality.
Initialization:
Synthesize Huffman codes and their tree according to the a prior source symbol
probabilities.
Assign wi to the symbol according to the a prior probability.
When we increment one count to wi when receiving the source symbol.
If there is a change in the Huffman code tree, re-synthesize the Huffman code tree
until it satisfies a Huffman code property.
Huffman code property:
This means that the structure of Huffman codes takes a form of ordered list by
probabilities. This property is also called “Sibling Property”.
1
0
)(M
k k
ii
w
waP
2012/10/24 Prof. Satoshi Nakamura 180
Adaptive Huffman code
Increment wi when we receive Si={s1,s2,s3,s4} . If the sibling property
doesn’t hold, re-synthesize a partial Huffman code tree.
Pi
s1 0.5 0.5 0.5 0
s2 0.3 0.3 10 0.5 1
s3 0.1 110 0.2 11
s4 0.1 111
Wi
s1 50 50 50 0
s2 30 30 10 50 1
s3 10 110 20 11
s4 10 111
2012/10/24 Prof. Satoshi Nakamura 181
Adaptive Huffman codes
Symbol S1 S2 S3 S4
Frequency 10 7 5 3
If we receive 10 times S4, how does Huffman tree change?
Symbol Freq. Code Freq. Code Freq. Code
S1 10 1 10 1 16 0
S2 7 01 9 00 10 1
S3 5 000 7 01
S4 4 001
#S4 +1
Symbol Freq. Code Freq. Code Freq. Code
S1 10 1 10 1 17 0
S2 7 01 10 00 10 1
S3 5 000 7 01
S4 5 001
#S4 +2
2012/10/24 Prof. Satoshi Nakamura 182
Adaptive Huffman codes
Symbol Freq. Code Freq. Code Freq. Code
S1 10 00 11 1 17 0
S2 7 01 10 00 11 1
S4 6 10 7 01
S3 5 11
#S4 +3
Symbol Freq. Code Freq. Code Freq. Code
S1 10 00 12 1 17 0
S2 7 01 10 00 10 1
S4 7 10 7 01
S3 5 11
#S4 +4
2012/10/24 Prof. Satoshi Nakamura 183
Adaptive Huffman codes
Symbol Freq. Code Freq. Code Freq. Code
S1 10 00 12 1 18 0
S4 8 01 10 00 12 1
S2 7 10 8 01
S3 5 11
#S4 +5
Symbol Freq. Code Freq. Code Freq. Code
S1 10 00 12 1 19 0
S4 9 01 10 00 12 1
S2 7 10 9 01
S3 5 11
#S4 +6
2012/10/24 Prof. Satoshi Nakamura 184
Adaptive Huffman codes
Symbol Freq. Code Freq. Code Freq. Code
S1 10 00 12 1 20 0
S4 10 01 10 00 12 1
S2 7 10 10 01
S3 5 11
#S4 +7
Symbol Freq. Code Freq. Code Freq. Code
S4 11 00 12 1 21 0
S1 10 01 11 00 12 1
S2 7 10 10 01
S3 5 11
#S4 +8
2012/10/24 Prof. Satoshi Nakamura 185
Adaptive Huffman codes
Symbol Freq. Code Freq. Code Freq. Code
S4 12 1 12 1 22 0
S1 10 01 12 00 12 1
S2 7 000 10 01
S3 5 001
#S4 +9
symbol S4 S4 S4 S4 S4 S4 S4 S4 S4 S4
Code 001 001 001 001 10 10 01 01 00 1
2012/10/24 Prof. Satoshi Nakamura 186
Dictionary code
Lempel-Ziv coding:
Coding algorithm using a dictionary (code table) including source symbol
sequences had been appeared.
Do not require a prior probability distributions of source symbols.
Non-block codes as well as the arithmetic code.
Compact codes as well as the arithmetic code.
In this method, coding from a source symbol sequence to a code sequence
is obtained in the following procedure.
1. Retrieval: Look for a source symbol sequence in the dictionary.
2. Coding: Code a source symbol sequence into a code sequence considering
an order in the dictionary.
3. Update: Update the dictionary in the decoding side.
2012/10/24 Prof. Satoshi Nakamura 187
LZ77 algorithm
Set empty sequence ( ) into the reference buffer.
Set a source symbol sequence into coding buffer.
Find a max symbol sub-sequence of the source symbol sequence in the
reference buffer . Here let sub-sequence starting from left most side in the
coding buffer be U and let sub-sequence with the same symbol sequence in
the reference buffer be U’. Let u be a next symbol of U, and let P be a
starting address pointer of U’. Let l be a length of U.
Now we encode a source symbol sequence into (P,l,u).
Shift left by l+1 bit until there will be no source symbol.
reference coding
2012/10/24 Prof. Satoshi Nakamura 188
LZ77 algorithm
source symbol sequence “abcabcdef”
source symbol code
a a
b b
c c
a (-3,3,d)
b
c
d
e e
f f
2012/10/24 Prof. Satoshi Nakamura 189
LZ77 algorithm
Properties of LZ77 algorithm:
LZ77 approaches to the compact code if the buffer length L and Ls become
large.
Sending u as a first mismatched symbol is inefficient.
If l is very short, the code length is longer than the original source symbol
length. In this case we just send the original source symbol sequence.
Use a fixed length of U.
Also use the relative address from the left most side of coding buffer or use
“Recency-Rank” meaning a number of different types of source symbols
instead of the relative address.
2012/10/24 Prof. Satoshi Nakamura 190
LZ78 algorithm
LZ78 algorithm:
Universal coding based on “Incremental parsing”.
Let the source symbol sequence be,
Incremental parsing
is decomposition into a partial code sequence
The partial code sequence satisfies,
are different each other except .
If we take a last symbol , equals to .
Tuuuu ,...,, 21
110 ,...,, tUUUu
)10( tmUm
0U
1tUtUUU ..., 10
)1( tmUm )10( msUsmu
2012/10/24 Prof. Satoshi Nakamura 191
Example
Each Um satisfies three properties and Um=Usum.
We can code the source symbol sequence into using
and um.
...
101100011010
]11001100001000110011001[
6543210 UUUUUUU
msm uUU
),( mus )10( mss
2012/10/24 Prof. Satoshi Nakamura 192
Example
Time In Out(s,um) Add to
Table Index
0 0 (-, 0) 0 0
1 1 (-, 1) 1 1
2 10 (1, 0) 10 2
3 01 (0, 1) 01 3
4 100 (2, 0) 100 4
5 101 (2, 1) 101 5
6 1000 (4, 0) 1000 6
7 010 (3, 0) 010 7
8 011 (3, 1) 011 8
Time In Out Add to Table Index
0 0 0 0 0
1 1 1 1 1
2 (1, 0) 10 10 2
3 (0, 1) 01 01 3
4 (2, 0) 100 100 4
5 (2, 1) 101 101 5
6 (4, 0) 1000 1000 6
7 (3, 0) 010 010 7
8 (3, 1) 011 011 8
Encoder Decoder
]11001100001000110011001[
2012/10/24 Prof. Satoshi Nakamura 193
Example
Time In Out(s) Add to
Table Index
0 01 0 01 2
1 11 1 11 3
2 10 1 10 4
3 00 0 00 5
4 011 2 011 6
5 100 4 100 7
6 010 2 010 8
7 011 2 011 9
Encoder
Time In Out Add to
Table Index
0 0 0(? 1 ) ? 01 2
1 1 1 (? 1 ) ? 11 3
2 1 1 (? 0) ? 10 4
3 0 0(? 0) ? 00 5
4 2 01(? 1 ) ? 011 6
5 4 10(? 0) ? 100 7
6 2 01(? 0) ? 010 8
7 2 01(? ) ? 9
Decoder
Input String Index
0 0
1 1
Initial code table
Move pointer to the position of next decomposed code -1.
2012/10/24 Prof. Satoshi Nakamura 194
Example
Time Send New Entry Index
0 1 (for 0) (a, 0) 3
1 0 (for 0) (0, 0) 4
2 4 (for 00) (0,0,b) 5
3 2 (for b) (b,0) 6
4 3 (for (a,0)) (a,0,a) 7
Ternary Encoder
In Out Reconstructed Sequence Add to Table
1 a (a)a,?
0 0 (a)a,0 (0)0,? (a, 0) (as 3)
4 ? (a)a,0 (0)0,? ?
Ternary Decoder
Input String Index
0 0
a 1
b 2
Initial code table
2012/10/24 Prof. Satoshi Nakamura 195
LZ78
Problems of LZ78
Coding by (s,um) is inefficient since we have to send um as it is.
The solution is to send only (s). This method is used in “compress
command” of Unix.
Incremental parsing stores all symbol sub-sequence in the dictionary
and assign addresses.
This algorithm may cause memory overflow of the dictionary. In such
a case we delete LRU (Least Recent Used) entry from the dictionary
by “Self-organizing list”.
2012/10/24 Prof. Satoshi Nakamura 196
Other code
Run-length code:
abbbbbbbab: a(b,7)ab
2012/10/24 Prof. Satoshi Nakamura 197
Rate Distortion
Coding with distortion:
An average code length per one source symbol can be reduced if we allow
coding distortion. Here, the distortion includes redundancy and errors
which prevent uniquely decodability.
Distortion measure:
Let x be an source information symbol of L.
Let y be a decoded output of the code.
The distance between x and y is d(x,y), and called distortion measure. We
evaluate the source coding efficiency by average distortion measure.
where, p(x,y) are a joint probability distribution of a source symbol
variable X and a coded symbol variable Y.
),(),( yxpyxddyx
2012/10/24 Prof. Satoshi Nakamura 198
Rate distortion
Mutual information:
For a channel without any distortion we can easily know the source
symbol x by knowing the decoded output y. The average amount of
information is H(X). If there is distortion, the average amount of
information is,
Therefore the lower bound of the average code length is the mutual
information I(X;Y).
Distortion will be different while the mutual information is the same.
For this case we try to find codes whose distortion satisfies
Under this condition we try to find codes which minimizes the I(X;Y)
This R(D) is called “Rate-Distortion Function” of the information source..
)()();( YXHXHYXI
.Dd
d
);(min)( YXIDRDd
2012/10/24 Prof. Satoshi Nakamura 199
Rate distortion
Definition:
Under the condition that the average distortion is less than D,
there exist codes whose average code length per one source
symbol satisfy,
for an integer .
But there is no codes that has smaller average code length than
R(D).
)()( DRLDR
2012/10/24 Prof. Satoshi Nakamura 203
Rate distortion
Derivation of RD function: The mutual information I(X;Y) is written in the following given Px(x) and conditional probabilities P(y|x),
We also know P(y) and the conditional probabilities P(y|x),
Next, is written by,
And probability constraints requests,
What we need is to minimize I(X;Y) under the above three constraints by the Lagrangean method.
)(
)()()();(
yP
xyPxyPxPYXI
yx
)()()( xyPxPyPx
DyxdxyPxPdyx
),()()(
0)( xyP 1)( xyPy
Dd
2012/10/24 Prof. Satoshi Nakamura 204
Rate distortion
Source coding with distortion:
Suppose we choose a symbol sequence
of length n from an information source S with k symbols. Now we
choose m codes that gives minimum average distortion.
,here the average distortion is given by,
is minimizing , that is,
Information
source
Coding to CD
Minimum dn(x,w)
wxSource coding
Of non-distortion
source
x w
)...,2,1(,..., 21
n
iniii kixxxx
)}...,,2,1(,...,,{ 21 mjwwwWC jnjjjD
))(,())(,(1
iwxpiwxdd jiji
k
i
nn
n
}...,,2,1);,({ mjwxd ji
),(minarg),(minarg)(1
ji
n
k
nj
jinj
wxdwxdij
)(ijw
2012/10/24 Prof. Satoshi Nakamura 205
Rate distortion
Then apply distortion-less source coding. This method provides
average distortion per each source symbol.
Decoding:
Decoding can be obtained by finding that minimizes the
distortion to code word .
Maximum likelihood decoding:
Find that satisfies,
for all m’ except m.
nd
jw ix
mx
),(),( 'mjmj xWpxWp
2012/10/24 Prof. Satoshi Nakamura 206
Rate distortion
Maximum a posteriori probability decoding:
Find , which maximizes,
However, a prior probability needs to be given. This method
is equivalent to a method maximizes the mutual information.
mx
)(
)()()(
j
mjm
jmwp
xwpxpwxp
)( mxP
)(
)(log);(
j
mj
mjwp
xwpExwI
2012/10/24 Prof. Satoshi Nakamura 207
Binary source
Suppose we have a binary information source of {0,1} with probabilities
of p, 1-p and let a bit error rate be distortion measure.
This source coding can be thought as a test transmission channel problem
where the following mutual information is minimized under distortion ,
yx
yxyxd
1
0),(
)()();( YXXXHYXI
d
+ x
Source
Px(1)=p
Error source
PE(1)= d
E
EXY
Test transmission channel
2012/10/24 Prof. Satoshi Nakamura 208
Rate distortion
Y can be thought as a symbol of which an error symbol added to a source
symbol x is with probability . Here, since the addition is “XOR”,
is equivalent to , then,
Furthermore, let be a zero memory binary source,
If the error source is zero-memory source, holds, and even
if the error source has a memory, holds.
Therefore,
d
EXY EYX
)()()( YEHYEYHYXH
).1log()1(log)(~
pppppH
)(~
)()( dHEHYEH
)(~
)( dHEH
)(~
)( dHEH
)(~
pH
)(~
)( dHYXH
2012/10/24 Prof. Satoshi Nakamura 209
Rate distortion
If , the Entropy function says,
then,
Finally, we have a RD function in the following.
5.00 D
)(~
)(~
DHdHDd
)(~
)(~
)(~
)(~
);( DHpHdHpHYXI
)(~
)(~
)( DHpHDR
2012/10/24 Prof. Satoshi Nakamura 210
Rate distortion
RD function for a binary information source
2012/10/24 Prof. Satoshi Nakamura 211
Source coding of analog information
Analog source coding:
Here we treat analog source information that can take continuous
value not a symbol. (ex. Speech, Image, Sensory input)
Analog signal Sampling
Quantization
2012/10/24 Prof. Satoshi Nakamura 212
Sampling
If the frequency band is limited to 0-W[Hz], the function f(t) can be
written by,
,...2,1,0,1...)2/( kWkfX k
k t
tk
kW
kWXtf
)2(
)2(sin)(
2012/10/24 Prof. Satoshi Nakamura 213
Sampling
Let spectrum of f(t) be a F(w), it can be written,
If F(w) is band limited in , it can be transformed by
Fourier expansion.
, here
From (1) ,
WwW 22
)1()()( dtetfwF iwt
)2()( 24
2
W
kwi
k
kW
kwi
k
k eaeawF
W
W
W
kiw
k dwewFW
a
2
2
2 )3()(4
1
)4()(2
1
)(2
1)(
2
2
W
W
iwt
iwt
dwewF
dwewFtf
2012/10/24 Prof. Satoshi Nakamura 214
Sampling
Now we set ,
comparing (3) and (5),
Therefore, we get
)5()(2
1)
2(
2
2
2W
W
W
kiw
dwewFW
kf
)6(22
1
W
kf
Wak
W
kt
2
)7()2
(2
1)( 2W
kiw
k
eW
kf
WwF
)8()2(
)2(sin)
2(
)2
(4
1)(
2
2
)2
(
kWt
kWt
W
kf
dweW
kf
Wtf
k
W
W
W
ktiw
k
2012/10/24 Prof. Satoshi Nakamura 215
Entropy of analog source
The Entropy for digital source is defined,
How can we define Entropy for stochastic variable x that takes
continuous value?
Now we divide a region into small region of x. The probability of
which x takes a value between xi and can be approximated by,
The smaller the is, the better approximation we have. Then,
i
ii ppH log
xxp i )(
x
xxi
x
xdxxpxp
xxpxxpxp
xxpxxpH
x
i
ix
i
i
ix
i
i
ix
loglim)(log)(
}{log)((lim)}({log)((lim
}))(log{)((lim'
0
00
0
2012/10/24 Prof. Satoshi Nakamura 216
Entropy of analog source
The second term goes to infinity. We only use this Entropy to compare
various analog sources. We define Entropy of analog source only by the
first term.
Unit Entropy:
The analog source has n stochastic variables x1,x2,…,xn, we define
Entropy by,
The Entropy per one variable is,
This is called an unit Entropy. And, Entropy normalized by T is called an
Entropy per second, H’. By , the following relationship holds.
dxxpxpH )(log)(
nnn dxdxdxxxxpxxxpH ...),...,(log),...,(... 212121
nnnn
dxdxdxxxxpxxxpn
H ...),...,(log),...,(...1
lim 212121
WHH 2'
TWn 2
2012/10/24 Prof. Satoshi Nakamura 217
Conditional Entropy
Definition:
,here P(x),P(y) are marginal probability distributions.
The following relationship holds as well as in the digital information source.
With equality if and only if,
dxdyxp
yxpyxpdxxYHxpXYH
)(
),(log),()()()(
dxdyyp
yxpyxpdyyXHypYXH
)(
),(log),()()()(
dyyxpxp ),()(
dxyxpyp ),()(
)()(),( YHXHYXH
)()(),( ypxpyxp
2012/10/24 Prof. Satoshi Nakamura 218
Entropy of Gaussian distribution
Probability distribution of Gaussian (Normal) distribution is,
The Entropy is given,
2
2
2 2exp
2
1)(
xxp
2
2
2
2
22
2
2
2log
2
12log
)}2
exp(2
1){log
2exp(
2
1
)(log)()(
e
dxxx
dxxpxpXH
2012/10/24 Prof. Satoshi Nakamura 219
Entropy of analog source
Gaussian process:
[Definition] Let probability distribution of variables Xt1, Xt2, …, Xtn at
time t1,t2,…,tn be P(Xt1,Xt2,…,Xtn). If P is subject to multi-dimensional
Gaussian distribution, we call this process as a Gaussian process. If this
process is subject to stationary Markov process, we call it a stationary
Markov process. If a power spectrum density n(w) of Gaussian process
has a constant value regardless to frequencies, we call it a white Gaussian
noise or process.
If a white Gaussian noise is band limited in frequency range W,
Furthermore if a time period this white Gaussian noise is limited in T, this
process is determined by a sample by 1/2W, x1,x2,…,x2TW. Let a power at
each sample be , the Entropy at each point is given by,
)2(0
)2()( 2
0
WwWf
WwWfwn
N
22log eH
2012/10/24 Prof. Satoshi Nakamura 220
Entropy of analog source
Therefore a Entropy for all 2TW samples is,
22log2 eTWHtotal
2012/10/24 Prof. Satoshi Nakamura 221
Maximum Entropy
Distribution function with maximum Entropy:
Find a probability distribution function with a maximum Entropy under
specific conditions. Now we have following relationships,
We find p(x) that maximizes an objective function I by the Lagrangean
method.
n
n
b
ann
b
a
b
a
kdxxpx
kdxxpx
kdxxpx
))(,(...
))(,(
))(,(
2
2
1
1
22
11
b
adxxpxFI ))(,(
2012/10/24 Prof. Satoshi Nakamura 222
Maximum Entropy
[Case an average power of x given]
Let an average power to be ,
We have p(x) maximizes H(X) by,
The Entropy with the p(x) is,
2
1)(
)(
)(log)()(
22
dxxp
dxxpx
dxxpxpXH
2
2
2 2exp
2
1)(
xxp
22log)(log)()( edxxpxpXH
2012/10/24 Prof. Satoshi Nakamura 223
Maximum Entropy
Maximum Entropy Theorem:
A probability distribution function of an average power that has
a maximum Entropy is Gaussian distribution.
The Entropy of Gaussian distribution is given by,
2
2
2
2 2exp
2
1)(
xxp
22log)(log)()( edxxpxpXH
2012/10/24 Prof. Satoshi Nakamura 224
Mutual information
Let joint probability distribution be p(x,y), If we divide a region of x into
and a region of y into . Here
are probabilities for x takes a value between x and , y takes a value
between y and , x and y jointly take values in the region,
respectively. The mutual information is given by,
x y yxyxpyypxxp iiii ),(,)(,)(
xx yy
)()(
),(log
)()(
),(log);(
ii
ii
ii
iiii
ypxp
yxp
yyxpxp
yxyxpyxI
2012/10/24 Prof. Satoshi Nakamura 225
Mutual information
An average mutual information is,
I(X;Y) is non negative value,
with equality if and only if,
)()(
)()(
)()(
),(log),(
)()(
),(log]),([lim);(
00
YXHXH
XYHYH
dydxypxp
yxpyxp
ypxp
yxpyxyxpYXI
ii
iiii
jiyx
0);( YXI
))((),( yxpyxp
2012/10/24 Prof. Satoshi Nakamura 226
Rate distortion for analog source
Rate distortion function:
Let an average distortion rate be d(x,y), the average distortion is given by,
here, p(x,y) is a joint probability distribution function of a source sample
value x and its decoded result y. We have a Rate-distortion function in the
similar manner as the discrete symbol case.
Let R(D) bit/sample be the minimum mutual information I(X;Y) of X
and Y under condition that the average distortion is smaller than the
threshold D. R(D) provides the lower bound of the average code length
per source symbol when we code it by the binary codes under the
condition that is smaller than D.
dydxyxpyxdd
),(),(
);(min)( YXIDRDd
d
d
2012/10/24 Prof. Satoshi Nakamura 227
Rate distortion of Gaussian source
We use the mean squire error,
The average distortion is given an average squire error.
Under the above condition, we minimize I(X;Y) with P(y|x),
, here
and
If the information is Gaussian source, we can use,
We maximize H(X|Y) instead of minimizing I(X;Y).
Ddydxyxxypxpd
2))(|()(
2)(),( yxyxd
dxdy
yp
xypxypxpYXI ]
)(
)|(log)|()[();(
dxxypxpyp )|()()(
)()();( YXHXHYXI
22log)( eXH
2012/10/24 Prof. Satoshi Nakamura 228
Rate distortion of Gaussian source
Let Z be a probabilistic variable Z=X-Y,
with equality if and only if Z and Y are independent. is smaller than D.
H(Z) will be maximized when p(y|x) follows a Gaussian distribution of mean 0 and variation D according to the maximum Entropy theorem. Then we have,
Therefore,
Finally, R(D) is given by,
)()()()( ZHYZHYYXHYXH
DZd 2
eDZH 2log)(
d
D
eDeYXI
2
2
log2
1
2log2log);(
DDR
2
log2
1)(
Bit/sample
2012/10/24 Prof. Satoshi Nakamura 229
Rate distortion of Gaussian source
When the source signal is band-limited to 0-W, we can have 2W
samples per second, the rate-distortion function per second is given
by,
DWDR
2
log)(
2012/10/24 Prof. Satoshi Nakamura 230
Coding of analog signal
Scalar quantization:
Scalar quantization is a discretization of value of a source sample.
We call this sample as a quantized sample. If we use B bit binary
representation, a quantized sample is represented by 2B bits.
Therefore, the necessary information for transmission or storage is,
This coding is called PCM (Pulse Code mudulation).
Important thing is to reduce necessary bit rates. Therefore, we
utilize a probability distribution of the amplitude distribution. The
quantization that minimize mean square errors with fixed
quantization level N is called a optimal quantization property.
sFBI Bit/second
2012/10/24 Prof. Satoshi Nakamura 231
Coding of analog source
Signal to Noise ratio:
Let peak-to-peak ratio of the target signal be 2Xmax, the quantization
level of B bit quantization is,
If we assume that the noise amplitude distribution is uniform, we get,
SNR will be,
Representation in dB will be,
2
2
2
2
2
2
)(
)(
)]([
)]([
e
x
n
n
ne
nx
neE
nxESNR
B
X
2
2 max
B
xdxxneE
2
2
max
222
2312
1)]([
2
2
2
max
2223
XSNR xB
)(log2077.46][ max10
x
XBdBSNR
2012/10/24 Prof. Satoshi Nakamura 232
Coding of analog source
Transform coding:
If we have consecutive two sample x1,x2 that have a uniform probability
distribution depicted in figure, where p(x1,x2) is,
range of x1, and x2 values is,
Quantization level will be,
We need bits to quantize x=(x1,x2) bits.
If we rotate 45 degree to have new basis (u1,u2). U1 and u2 are
independent, necessary quantization levels are L1 for u1, and L2 for u2.
C
Cxxpxxp ab
0)(),(
1
21
22,
2221
baxx
ba
221
baLL
2
2
212
)(logloglog
baLLBx
bL
aL 21 ,
2012/10/24 Prof. Satoshi Nakamura 233
Coding of analog source
Namely we need bits to quantize u=(u1,u2).
For example if a=2b,
221 logloglog
ab
LLBu
17.1 ux BB
2012/10/24 Prof. Satoshi Nakamura 234
Vector quantization
A method quantizes not a single sample but a set of n samples.
Suppose we have source samples that are independent each other and
have a uniform distribution. This quantization is equivalent to assigning
this sample to a center point of he square area that is made by splitting
x0,x1 2dimensional area by squares. The size of the area is , and
quantization error is , average mean square error per one sample is
, this is a same as scalar quantization.
If we change the shape of the region to a hexagon, the size of the area is
and average quantization error is with the same
number of the representative points.
If we set the area size to be the same of the square and the hexagon, the
average power of the hexagon becomes
2
6/2
12/2
2/33 2 8/35 2
962.09/35
2012/10/24 Prof. Satoshi Nakamura 235
Vector quantization
Representative points of a square Representative points of a hexagon
Vector quantization
2012/10/24 Prof. Satoshi Nakamura 236
Vector quantization
Vector quantization is a quantization method that codes a source
sample (x0,x1,…,xn-1) composed of n consecutive samples to a
closest representative code chosen from representative codes in n
dimensional sample space (X0,X1,…,Xn-1).
If we apply vector quantization to a source sample so as to
minimize an average distortion and apply distortion-less source
coding, we can have a code, of which average length per sample
approaches to the lower bound R(D) according to the size of n.
2012/10/24 Prof. Satoshi Nakamura 237
Vector quantization
Representative points are called code words or code vectors. A set of code
words is called a codebook.
Codebook design algorithm:
There is no optimal algorithm for the codebook design. Here we
introduce a semi-optimal iterative codebook design algorithm.
Now we have k training samples x1,x2,…,xk and centroids defined in the
following.
, here means an operation to find n minimizes f(n).
),(minarg),...,,(ˆ1
21 i
k
ixk xxdxxxCx
)(minarg nfx
2012/10/24 Prof. Satoshi Nakamura 238
Vector quantization
LBG(Linde,Buzo,Gray) Algorithm
Initialization(Step1)
Let training sample set be xj, j=0,…,n-1,
N: Codebook size, m=0, : distortion, and .
Set an initial codebook randomly.
Partitioning(Step2)
Cluster xj into N partial sets Si: i=0,…,N by .
, here the average distortion is given by,
If , then stop, else set be a codebook .
Calculate , go to step 2.
)0(
1
)0(
0
)0( ,..., NN yyA 1D
)(m
NA
ij
m
iji Sxyxdi ˆ)( ),,(minargˆ
} |),({1 )(
1
01ij
m
ii
n
jim SxyxdN
nD
mmm DDD /)( 1
)(m
NA
)1(
1
)1(
0
)1( ,...,
m
N
mm
N yyA 1})({)1( mmSCy i
m
i
2012/10/24 Prof. Satoshi Nakamura 239
Vector quantization
Splitting algorithm:
(Step1)Initialization:
: Arbitrary vector with small norm.
M=1,
(Step2)Split into neighboring two vectors,
Let be,
(Step3)Letting A0,2M be initial values, find sub-optimal codebook
by a LBG algorithm. If M=N then stop, else
set M=2N and go to step 2.
Splitting and LBG algorithm generate a codebook of size 2N.
),...,,( 1211,0 nxxxCA
),...,,( 110,0 MM yyyA ii yy ,
},...,,,,,{ 1100 MMii yyyyyy
},...,,{ 12102,0 MM yyyA
},...,1,0;{ 122,0 MiM yiyA
2012/10/24 Prof. Satoshi Nakamura 240
D(R) function
Distortion rate function:
Let x be N consecutive samples of x(n), vector quantization that codes x
into y with a codebook size of L is given by,
, where
D(R) represents a minimul average distortion with given range of the rate
R. On the other hand, R(D) represents a maximum rate or minimum
average code length with given range of the distortion D.
)],([min)( yxdERDy
N
RyHN
)(1
)(lim)( RDRD NN
2012/10/24 Prof. Satoshi Nakamura 241
Vector quantization
Tree search VQ:
Make tree structure codebook. Each node in the tree represents a code
obtained in the splitting algorithm. The computation of the tree search
VQ is Klog2N to compared to K*N with a parameter dimension of K. The
memory size increases about to twice. Input vector
Code vectors
Code output
2012/10/24 Prof. Satoshi Nakamura 242
Multi-step VQ
Combine multiple vector quantizers to reduce calculation. Codes of each
quantizers are sent to the channel. Number of multiplication can be
reduced from K*N*M to K*(N+M).
codebook codebook
Vector
quantizer
Vector
quantizer
Input vector
2012/10/24 Prof. Satoshi Nakamura 243
Gain/Shape vector quantization
Gain/Shape vector quantization:
Codebook is composed of multiplication of Ng scalar values, g1, g2,..,gNg
and Ng unit vectors, u1,u2,…,uNg.
, here we call a gain codebook, and a shape
codebook. Coding algorithm is shown in the following.
Shape quantization:
Make inner product between an input vector x and u in the shape codebook
u1,u2, …,uNs and find a unit vector ul gives maximum inner product.
Gain quantization:
Find a closest scalar value from a gain codebook g1,g2,…,gNg to the maximum
inner product of (x, ul). Here gk*ul is a quantization vector out of Ng*Ns
quantization samples. Therefore number of calculation is reduced from
K*Ng*Ns to K*Ns and memory size from Ng*Ns to K*(Ns+Ng).
sgji NjNiug ,...,2,1,,...,2,1,
Ngggg ,..., 21 Nsuuu ,..., 21
2012/10/24 Prof. Satoshi Nakamura 244
Gain/shape vector quantization
Code vector
2012/10/24 Prof. Satoshi Nakamura 245
Speech Coding
/sec]
t/sec]
Min.
Min.
2012/10/24 Prof. Satoshi Nakamura 246
Waveform Coding
PCM (Pulse Code Modulation) used in CD, DAT
If signal is band-limited to 0-W[Hz]
T: Sampling Interval [s]
Concept of PCM
2012/10/24 Prof. Satoshi Nakamura 247
Waveform coding (PCM)
Quantization
Let quantization step to be , quantization bit to be B,
range of signal amplitude to be L.
2012/10/24 Prof. Satoshi Nakamura 248
Waveform coding
Speech waveform
Non-uniform
law
2012/10/24 Prof. Satoshi Nakamura 249
Waveform coding (u-law)
law is used for ISDN.
law (u=255)
2012/10/24 Prof. Satoshi Nakamura 250
Waveform coding (DPCM)
DPCM (Differential PCM)
2012/10/24 Prof. Satoshi Nakamura 251
Waveform coding (DPCM)
If quantization step is 1, quantization bit B is 5.
2012/10/24 Prof. Satoshi Nakamura 252
Waveform coding (APCM)
APCM (Adaptive PCM)
2012/10/24 Prof. Satoshi Nakamura 253
Waveform coding (APCM)
APCM (Adaptive PCM)
Quant. Bits
2012/10/24 Prof. Satoshi Nakamura 254
Waveform coding (ADPCM)
ADPCM (Adaptive Differential PCM)
2012/10/24 Prof. Satoshi Nakamura 255
Waveform coding (ADPCM)
ADPCM (Adaptive Differential PCM)
Quant. Bits
2012/10/24 Prof. Satoshi Nakamura 256
Parametric speech coding
---------------
Speech waveform
Excitation: Pitch frequencey
Excitation signal
Vocal Tract:
Phonetic Content
Resonance filter
2012/10/24 Prof. Satoshi Nakamura 257
Parametric speech coding
Speech waveform
Framing
Short term
predict.
Excitation Signal
Linear
Prediction Coeff.
Resonance filter
Linear
Prediction Coeff.
Codebook
Code
Code Pitch
Codebook
Pitch Interval
Approximation
2012/10/24 Prof. Satoshi Nakamura 258
Parametric speech coding
Points of the parametric speech model
Approximation of excitation signal by the Impulse sequence.
Bit rates can decrease.
However, speech quality degrades seriously.
2012/10/24 Prof. Satoshi Nakamura 259
Parametric speech coding (CELP)
CELP (Code-excited Linear Prediction): Cellular phones
Speech waveform
Framing
Short term
predict.
Linear
Prediction
Coeff.
Resonance filter
Excitation Signal
Long term
predict.
Residual Signal
Pitch Interval
Linear Prediction
Coeff. Codebook
Pitch
Codebook
Gain
Codebook
Res. Signal
Codebook
Code
Code
Code
Code
2012/10/24 Prof. Satoshi Nakamura 260
Parametric speech coding (CELP)
CELP (Code-excited Linear Prediction): Cellular phones
Speech waveform
Res. Signal
Codebook
Code
Gain
Code
Gain
Long
term
predict.
Short
term
predict.
Perceptual
Weighting
Filter MSE
2012/10/24 Prof. Satoshi Nakamura 261
Speech coding
Waveform coding
Hybrid coding
Parametric
coding
2012/10/24 Prof. Satoshi Nakamura 262
Music coding
Usage of auditory characteristics for coding not of source model.
Audible
Non-audible
So
un
d p
ress
ure
leve
l
Frequency
Non-audible
Audible
Minimum audible
limit in quiet
environments
2012/10/24 Prof. Satoshi Nakamura 263
Music coding
Frequency masking
Audible
No-audible
Audible
Masking effect
So
un
d p
ress
ure
leve
l
Frequency
2012/10/24 Prof. Satoshi Nakamura 264
Music coding
Temporal Masking So
un
d p
ress
ure
leve
l
Frequency
Backward
Masking Forward Masking
Temporal Masking Curve
2012/10/24 Prof. Satoshi Nakamura 265
Music coding
Subband
Filter
Selection of
scale factor
Quantization samples Quanti-
zation
Masking threshold
Estimation
2012/10/24 Prof. Satoshi Nakamura 266
Music coding
Permissible error estimation
MP3: MPEG-1/L3, MPEG-2
2012/10/24 Prof. Satoshi Nakamura 267
MPEG1/Audio MPEG2/Audio
No. 11172.3 13818.3
IS year 1992 1994
Low Sampling Fq. Multi-lingual,
Multi-channnel
Sampling FQ. 32,44.148 16,22.05,24 32,44.1,48
Layer I II III I II III I II III
Bit rate min 32 32 32 32 8 8 32 32 32
max 448 384 320 256 160 160
channel 1/0, 2/0 1/0, 2/0 1/0,2/0,3/0,2/1,2
/2,3/1,3/2