+ All Categories
Home > Documents > Lecture1

Lecture1

Date post: 25-May-2015
Category:
Upload: ntpc08
View: 143 times
Download: 1 times
Share this document with a friend
Popular Tags:
41
INFORMATION THEORY 1 INFORMATION THEORY Communication theory deals with systems for transmitting information from one point to another. Information theory was born with the discovery of the fundamental laws of data compression and transmission.
Transcript
Page 1: Lecture1

INFORMATION THEORY 1

INFORMATION THEORY

Communication theory deals with systems for transmitting information from one point to another.

Information theory was born with the discovery of the fundamental laws of data compression and transmission.

Page 2: Lecture1

INFORMATION THEORY 2

Introduction

Information theory answers two fundamental questions:

• What is the ultimate data compression? Answer: The Entropy H.• What is the ultimate transmission rate? Answer: Channel Capacity C.

But its reach is beyond Communication Theory. In early days it was thought that increasing transmission rate over a channel increases the error rate. Shannon showed that this is not true as long as rate is below Channel Capacity.

Shannon has further shown that random processes have an

irreducible complexity below which they can not be compressed.

Page 3: Lecture1

INFORMATION THEORY 3

Information Theory (IT) relates to other fields:

• Computer Science: shortest binary program for computing a string.

• Probability Theory: fundamental quantities of IT are used to estimate probabilities.

• Inference: approach to predict digits of pi. Infering behavior of stock market.

• Computation vs. communication: computation is communication limited and vice-versa.

Page 4: Lecture1

INFORMATION THEORY 4

It has its beginning at the start of the century but it really took of after

WW II.

• Weiner: extracting signals of a known ensemble from noise of a

predictable nature.

• Shannon: encoding messages chosen from a known ensemble so that they can be transmitted accurately and rapidly even in the presence of noise.

IT: The study of efficient encoding and its consequences in the form of speed of transmission and probability of error.

Page 5: Lecture1

INFORMATION THEORY 5

Historical Perspective

• Follows S. Verdu, “Fifty years of Shannon Theory,” IT-44, Oct. 1998, pp. 2057-2058.

• Shannon published “ A mathematical theory of communication” in 1948. It lays down fundamental laws of data compression and transmission.

• Nyquist (1924): transmission rate is proportional to the log of the number of levels in a unit duration.

- Can transmission rate be improved by replacing Morse by an ‘optimum’ code?

• Whitlaker (1929): loseless interpolation of bandlimited function.

• Gabor (1946): time-frequency uncertainty principle.

Page 6: Lecture1

INFORMATION THEORY 6

• Hartley (1928): muses on the physical possibilities of transmission rates.

- Introduces a quantitative measure for the amount of

information associated with n selection of states.

H=n log s

where s = symbols available in each selection. n = # of selections. - Information = outcome of a selection among a finite number

of possibilities.

Page 7: Lecture1

INFORMATION THEORY 7

Data Compression

• Shannon uses the definition of entropy

as a measure of information. Rationals: (1) continuous in prob. (2) increasing with n for equiprobable r.v. (3) additive – entropy of a sum of r.v. is equal

to the sum of entropies of the individual r.v. • Entropy satisfies for memoryless sources: Shannon Theorem 3: Given any and , we can

find No such that sequences of any length fall into two classes:

(1) A set whose probability is less than (2) The reminder set, all of whose members have

probabilities {p} satisfying

1

logn

i ii

H p p

0 0

0N N

1log pH

N

Page 8: Lecture1

INFORMATION THEORY 8

Reliable Communication

• Shannon: …..redundancy must be introduced to combat the particular noise structure involved … a delay is generally required to approach the ideal encoding.

• Defines channel capacity

• It is possible to send information at the rate C through the channel with as small a frequency of errors or equivocation as desired by proper encoding. This statement is not true for any rate greater than C.

• Defines differential entropy of a continuous random variable as a formal analog to the entropy of a discrete random variable.

• Shannon obtains the formula for the capacity of: - power-constrained - white Gaussian channel - flat transfer function

max( ( ) ( / ))C H X H X Y

logP N

C WN

Page 9: Lecture1

INFORMATION THEORY 9

• The minimum energy necessary to transmit one bit of information is 1.6 dB below the noise psd.

• Some interesting points about the capacity relation: - Since any strictly bandlimited signal has infinite duration, the

rate of information of any finite codebook of bandlimited waveforms is equal to zero.

- Transmitted signals must approximate statistical properties of white noise.

• Generalization to dispersive/nonwhite Gaussian channels by Shannon’s “water-filling” formula.

• Constraints other then power constraints are of interest: - Amplitude constraints - Quantized constraints - Specific modulations.

Page 10: Lecture1

INFORMATION THEORY 10

Zero-Error Channel Capacity

• Example of typing a text: a non-zero probability of making an error with the prob. = 1 as the length increases.

• By designing a code that takes into account the statistics of the typist’s mistakes, the prob. of error can be made 0.

• Example: consider mistakes made by mistyping neighboring letters. The alphabet { b, I, t, s} has no neighboring letters, hence will have zero probability of error.

• Zero-error capacity: the rate at which information can be encoded with zero prob. of error.

Page 11: Lecture1

INFORMATION THEORY 11

Error Exponent

• Rather than focus on the channel capacity, study the error probability (EP) as a function of block length.

• Exponential decrease of EP as a function of blocklength in Gaussian, discrete memoryless channel.

• The exponent of the minimum achievable EP is a function of the rate referred to as reliability function.

• An important rate that serves as lower bound to the reliability function is the cutoff rate.

• Was long thought to be the “practical” limit to transmission rate.

• Turbo codes refuted that notion.

Page 12: Lecture1

INFORMATION THEORY 12

ERROR CONTROL MECHANISMS

Error Control Strategies

• The goal of ‘error-control’ is to reduce the effect of noise in order to reduce or eliminate transmission errors.

• ‘Error-Control Coding’ refers to adding redundancy to the data. The redundant symbols are subsequently used to detect or correct erroneous data.

Page 13: Lecture1

INFORMATION THEORY 13

• Error control strategy depends on the channel and on the specific application.

- Error control for one-way channels are referred to as forward error control (FEC). It can be accomplished by:

* Error detection and correction – hard detection. * Reducing the probability of an error – soft detection - For two-way channels: error detection is a simpler task that

error correction. Retransmit the data only when an error is detected: automatic request (ARQ).

• In the course, we focus on wireless data communications, hence we will not delve in error concealment techniques such as interpolation, used in audio and video recording.

• Error schemes may be priority based, i.e., providing more protection to certain types of data that others. For example, in wireless cellular standards, the transmitted bits are divided in three classes: bits that get double code protection, bits that get single code protection, and bits that are not protected.

Page 14: Lecture1

INFORMATION THEORY 14

Block and Convolutional Codes

• Error control codes can be divided into two large classes: block codes and convolutional codes.

• Information bits encoded with an alphabet Q of q distinct symbols.

• Designers of early digital communications system tried to improve reliability by increasing power or bandwidth.

Page 15: Lecture1

INFORMATION THEORY 15

• Shannon taught us how to buy performance with a less expensive resource: complexity.

• Formal definition of a code C: a set of 2k n-tuples.

• Encoder: the set of 2k pairs (m,c), where m is the data word and c is the code word.

• Linear code: the set of codewords is closed under modulo-2 addition.

• Error detection and correction correspond to terms in the Fano inequality:

- Error detection reduces - Error correction reduces

( / ) ( ) ( ) log(2 1)kH X Y H e P e

( ) log(2 1)kP e ( )H e

Page 16: Lecture1

INFORMATION THEORY 16

BASIC DEFINITIONS

Define Entropy, Relative Entropy, Mutual Information

Entropy, Mutual Information A measure of uncertainty of a random variable. Let x be a discrete random variable (r.v.) with alphabet A and probability mass p(x) = Pr {X=x}.

• (D1) The entropy H(x) of a discrete r.v. x is defined bits where log is to the base 2. • Comments: (1) simplest example: entropy of a fair coin = 1bit. (2) Adding terms of zero probability does not

change entropy (0log 0 = 0). (3) Entropy depends on probabilities of x, not on

actual values. (4) Entropy is H(x) = E [ log 1/p(x) ]

( )x A

( ) ( ) log ( )x A

H X p x p x

Page 17: Lecture1

INFORMATION THEORY 17

Properties of Entropy

• (P1) H(x) 0 0 p(x) 1 log [ 1/p(x) ] 0• [E] x = 1 p 0 1-p H(x) = - p log p – (1-p) log(1-p) = H(p)

Page 18: Lecture1

INFORMATION THEORY 18

• [E] x = a 1/2 b 1/4 c 1/8 d 1/8 H(x) = ½log½ - ¼log¼ - 1/8log 1/8 - 1/8log 1/8 = 1.75 bits

Another interpretation of entropy Use minimum number of questions to determine value of X: Is X=a no Is X=b no Is X=c

It turns out that the expected number of binary questions is 1.75.

Page 19: Lecture1

INFORMATION THEORY 19

• (D2) The joint entropy H(X,Y) is defined

• (D3) Conditional entropy H(Y|X)

( , ) ( , ) log ( , )

( , ) log ( , )

where ( , ) ( , )

( | ) ( ) ( | )

( ) ( | ) log ( | )

( , ) log ( | )

[log ( | )]

x A y B

x A

x A y B

x A y B

H X Y p x y p X Y

or

H X Y E p X Y

X Y p x y

H Y X p x H Y X x

p x p y x p y x

p x y p y x

E p y x

Page 20: Lecture1

INFORMATION THEORY 20

• (P2)

Entropy: A measure of uncertainty of a r.v. The amount of information required on the average to

describe the r.v. Relative entropy: A measure of the distance between two

distributions.

( )

( , ) ( ) ( | )

( , ) ( , ) log ( , )

( , ) log ( ) ( | )

( , ) log ( ) ( , ) log ( | )

( ) log ( ) ( , ) log ( | )

( , ) ( ) ( | )

x y

x y

x y x y

x x y

chain rule

H X Y H X H Y X

H X Y p x y p x y

p x y p x p y x

p x y p x p x y p y x

p x p x p x y p y x

H X Y H X H Y X

Page 21: Lecture1

INFORMATION THEORY 21

• (D4) Relative entropy or Kullback Leibler distance between two probability mass functions p(x) and q(x) is defined

Relative entropy is 0 iff p=q Mutual information: A measure of the amount of information

one r.v. contains about another r.v..

• (D5) Given two r.v. , and marginal distributions p(x), p(y), the mutual information is the relative entropy between the joint distribution p(x,y) and the product distribution p(x)p(y):

,

( )( ) ( ) log

( )

( , )( ; ) ( , ) log

( ) ( )

( ( , ) ( ) ( ))

x

x y

p xD p q p x

q x

p x yI X Y p x y

p x p y

D p x y p x p y

, ( , )X Y p x y

Page 22: Lecture1

INFORMATION THEORY 22

• (E)

In general Properties of MI:

,

,

{0,1}

(0) 1 , (1) 1 , (0) 1 , (1)

1( ) (1 ) log log

11 1

,2 4

( ) 0.2075 bits

( ) 0.1887 bits

( , )( ; ) ( , ) log

( ) ( )

( / )( , ) log

( )

( , ) log ( ) ( , ) log

x y

x y

A

p r p s q s q s

r rD p q r r

s s

r s

D p q

D q p

p x yI X Y p x y

p x p y

p x yp x y

p x

p x y p x p x y

, ,

,

( / )

( ) log ( ) ( , ) log ( / )

x y x y

x x y

p x y

p x p x p x y p x y

( ) ( )D p q D q p

Page 23: Lecture1

INFORMATION THEORY 23

• (P1) I(X,Y) = H(X) – H(X,Y)

Interpretation: Mutual Information (MI) is the reduction in the uncertainty of X due to the knowledge of Y.

X says about Y as much as Y says about X:

• (P2) I(X,Y) = H(Y) – H(Y|X) = I(Y,X)

• (P3) I(X,X) = H(X) (no reduction of certainly)

Since H(X,Y) = H(X) + H(Y|X) (chain rule), it follows that H(Y|X) = H(X,Y) – H(X), hence

• (P4) I(X,Y) = H(X) + H(Y) – H(X,Y)

Page 24: Lecture1

INFORMATION THEORY 24

Multiple Variables – Chain Rules In this section, some of the results of the previous section are

extended to multiple variables. • (T1) Chain Rule for Entropy: Let

• (D6) The conditional mutual information of random variables X and Y given Z is defined by

I(X;Y|Z) = H(X,Z) – H(X|Y,Z)

1 2 1 2

1 2 1 11

1 2 1 2 1

1 2 3 1 2 3 1

1 2 1 3 2 1

, ,..., ( , ,..., )

then

( , ,..., ) ( | ,..., )

( , ) ( ) ( | )

( , , ) ( ) ( , | )

( ) ( | ) ( | , )

n n

n

n i ii

X X X p x x x

H X X X H X X X

H X X H X H X X

H X X X H X H X X X

H X H X X H X X X

Page 25: Lecture1

INFORMATION THEORY 25

• (T2) Chain rule for mutual information:

This can be generalized to arbitrary n.

1, 2

1, 2

1 1 1 11

1 21 2 1 2

, 1 2

1 2 1 2

1 21 2 1 2

, 1 2

1 2 1

( ,..., ) ( ; | ,..., )

Proof :

( , , )( , ) ( , , ) log

( , ) ( )

use

( , ; ) ( , | ) ( ), then

( , | )( , ; ) ( , , ) log

( , )

( , , ) log ( ,

n

n ii

x x y

x x y

I X X I X Y X X

p x x yI X X p x x y

p x x p y

p X X Y p x x y p y

p x x yI X X Y p x x y

p x x

p x x y p x x

1, 2 1, 2

1, 2 1, 2

2 1 2 1 2, ,

1 2 1 2 1 2 1 2,

1 2 1 2

1 2 1 1 2 1

1 2 1 2

) ( , , ) log ( , | )

( , ) log ( , ) ( , , ) log ( , | )

( , ) ( , | )

( ) ( | ) [ ( / ) ( | , )]

( , | ) ( ; ) ( ;

x x y x x y

x x x x y

p x x y p x x y

p x x p x x p x x y p x x y

H X X H X X Y

H X H X X H X Y H X X Y

I X X Y I X Y I X Y

1| )X

Page 26: Lecture1

INFORMATION THEORY 26

• (D7) The conditional relative entropy D(p(y|x)||q(y|x)) is the relative entropy between the corresponding conditional distributions averaged over x:

• (T3) Chain rule for relative entropy

Proof:

,

, ,

( | )( ( | ) || ( | )) ( ) ( | ) log

( | )

( ( , ) || ( , )) ( ( ) || ( )) ( ( | ) || ( | ))

( , )( ( , ) || ( , )) ( , ) log

( , )

( ) ( | )( , ) log ( , ) log

( ) ( | )

( ( ) || ( )

x y

x y

x y x y

p y xD p y x q y x p x p y x

q y x

D p x y q x y D p x q x D p y x q y x

p x yD p x y q x y p x y

q x y

p x p y xp x y p x y

q x q y x

D p x q x

) ( ( | ) || ( | ))D p y x q y x

Page 27: Lecture1

INFORMATION THEORY 27

Jensen’s Inequality

• (D8) f(x) is convex over interval (a,b) if

Strictly convex if the strict inequality holds.

• (D9) f(x) is concave if –f is convex.

Convex function always lies below any chord (straight line connecting two points on the curve). Convex function are very important in I.T..

1 2

1 2 1 2

, ( , ),0 1

( (1 ) ) ( ) (1 ) ( )

x x a b

f x x f x f x

Page 28: Lecture1

INFORMATION THEORY 28

Simple results for convex functions:• (T4) If the function is convex Proof: Taylor Expansion: where let Since the last term in the Taylor expansion is non-negative,

Similarly

Using

The relation meets the definition of a convex function.

' '' * 20 0 1 0 0

*0

0 1 2

'1 1 0 0 1 2

'0 0 1 2

'2 2 0 0 2 1

'1 0 0 1 2

2

1( ) ( ) ( )( ) ( )( )

2

(1 )

( ) ( ) ( )( )

( ) ( )[(1 )( )]

( ) ( ) ( )[ ( )]

( ) ( ) ( ) (1 )( )

(1 ) ( ) (1 )

f x f x f x x x f x x x

x x x

x x x

x x f x f x f x x x

f x f x x x

x x f x f x f x x x

f x f x f x x x

f x

'0 0 2 1

1 2 0

0 1 2

( ) ( ) (1 )( )

( ) (1 ) ( ) ( )

(1 ) ,

f x f x x x

f x f x f x

x x x

"( ) 0f x

Page 29: Lecture1

INFORMATION THEORY 29

• (T5) (Jensen’s Inequality) (1) If f is convex and X is a r.v.

(2) If f is strictly convex and E f(X) = f(EX) then, X = EX, i.e., X is a constant. Proof: Let the number of discrete points be 2: X1, X2 From the definition of convex functions:

Induction: suppose the theorem is true for k-1 points. let , this makes pi a set of probabilities.

From Jensen’s inequality follow a number of fundamental IT theorems

1 2 2 1 1 2 2

1'

1 1

1'

1

1'

1

1

[ ( )] ( ) ( ) ( ) ( )

/(1 )

( ( )) ( ) ( ) (1 ) ( )

( ) (1 ) ( )

( (1 ) )

( )

i i k

k k

i i k k k i ii i

k

k k k i ii

k

k k k i ii

k

i ii

E f X p f X p f X f p X p X f EX

p p p

E f X p f x p f x p p f x

p f x p f p X

f p X p p X

f p X

( ) ( )Ef X f EX

Page 30: Lecture1

INFORMATION THEORY 30

• (T6) (Information inequality) p(X), q(x) With equality iff Proof:

If , equality is clearly obtained. If equality holds, it means that (since the log is

strictly concave).

( )( || ) ( ) log

( )

( )( ) log

( )

( )log ( )

( )

log ( ) log1 0

x A

x A

x A

x A

p xD p q p x

q x

q xp x

p x

p xp x

q x

q x

( || ) 0D p q

( ) ( ) p x q x x

( ) ( )p x q x( ) ( )q x p x

Page 31: Lecture1

INFORMATION THEORY 31

• (T7) (Non-negativity of MI) I(X;Y) = 0 iff X, Y are independent.

Proof: Follows from the relation From the information inequality, the equality holds iff

p(x,y) = p(x)p(y), i.e., X, Y are independent.

Let |A| be the number of elements in the set A.• (T8) iff X has a uniform distribution over A.

Proof: Let u(X) = 1/ |A| be the uniform distribution.

Interpretation: uniform distribution achieves maximum entropy.

( )( || ) ( ) log ( ) log( ( ))

( )

log ( ) 0

p xD p u p x p x A p x

u x

A H X

r.v. ,X Y ( ; ) 0I X Y

( ; ) ( ( , ) || ( ) ( )) 0I X Y D p x y p x p y

( ) | |, ( ) log | |H X A H X A

Page 32: Lecture1

INFORMATION THEORY 32

• (T9) (Conditioning reduces entropy) H(X|Y) = H(X) iff X and Y are independent Proof: It follows from Interpretation: on the average, knowing about Y can only reduce

the uncertainty about X.

The uncertainty of X is decreased if Y=1 is observed, it is increased if Y=2 is observed, and is decreased on the average.

1( ) ( , ) ( 1) (1, )

8

7( 2) (2, )

8

1 7( ) ( , ) 0.544 bits

8 83 3

( | 1) ( |1) log ( |1) 0 log 0.31134 4

1 1 1 1 3( | 2) ( | 2) log ( | 2) log log

8 8 8 8 4

3 1( | ) ( | 1) ( | 2) 0.4210

4 4

y y

y

x

x

p x p X Y p x p y

p x p y

H X H

H X Y p x p x

H X Y p x p x

H X Y H X Y H X Y

( | ) ( )H X Y H X

0 ( , ) ( ) ( | )I X Y H X H X Y

Page 33: Lecture1

INFORMATION THEORY 33

• (T10) (Independence bound for entropy)

equality iff Xi are independent.

Proof: Chain Rule

T9

1 2 1 2

1 21

1 2 1 2 1

1 2

, ,..., ( , ,..., ), then

( , ,..., ) ( )

( , ) ( ) ( | )

( ) ( )

n n

n

n ii

X X X p X X X

H X X X H X

H X X H X H X X

H X H X

Page 34: Lecture1

INFORMATION THEORY 34

• (T11) (Log Sum Inequality)

Equality iff ai/bi = const

Proof:

is strictly convex since its second derivative > 0, hence by Jensen’s inequality

1

1 1

1

,..., , ,..., 0

log ( ) log

let

0 and 1

( ) log

( ) ( )

set

/ , / ,

then

( ) * log * log *

log

i

i n i n

n

in ni i

i i ni ii

ii

i i

i i i i

i i j i i

i i i i i i ii i

j i i j i j i

i i

j i

a a b b

aa

a ab b

f t t t

f t f t

b b t a b

b a a b a b af t

b b b b b b b

a a

b b

logi i

i i ij ji

a a

b b

Page 35: Lecture1

INFORMATION THEORY 35

• (T12) Convexity of Relative Entropy

Proof: log sum inequality

• (T13) Concavity of Entropy H(p) is a concave function of p.

Proof: H(p) = log A - D(p||u) since D is convex , H is concave.

• (D10) The r.v. X, Y, Z form a Markov Chain X Y Z if (z is conditionally independent of x)

1 2 1 2 1 1 2 2

1 2 1 21 2 1 2

1 2 1 2

1 1 2 2

( (1 ) || (1 ) ) ( || ) (1 )( || )

(1 ) (1 )( (1 ) ) log log (1 ) log

(1 ) (1 )

( || ) (1 ) ( || )

D p p q q D p q p q

p p p pLHS p p p p

q q q q

D p q D p q

( , , ) ( ) ( | ) ( | )x y z p x p y x p z y

Page 36: Lecture1

INFORMATION THEORY 36

• (T14) Data Processing Inequality: If X Y Z, then I(X;Y) I(X;Z) Proof: Chain Rule for information (T2 slide 46) I(X;Y,Z) = I(X;Z) + I(X;Y|Z)

also I(X;Y,Z) = I(X;Y) + I(X;Z|Y) since X, Z are independent given Y , I(X;Z|Y) = 0 It follows

Equality iff I(X;Y|Z) = 0 i.e. X Y Z also form a Markov Chain.

In particular if Z = g(Y) we have A function of the data Y can not increase the information

about X.

( ; ) ( , ( ))I X Y I X g Y

( ; ) ( ; , ) ( ; ) ( ; | )

( , )

I X Y I X Y Z I X Z I X Y Z

I X Z

Page 37: Lecture1

INFORMATION THEORY 37

Application – Sufficient Statistic

Use data processing inequality to clarify idea of sufficient statistic.

• (D11) A function T(X) is a sufficient statistic relative to the family

if X is independent of given T(X), i.e. , T(X) provides all info on . In general, we have a family of distributions, X a sample from a dist. T(X) a function of the sample. Hence, by the data processing inequality

For a sufficient statistic which means that MI

is preserved.

( ; ( )) ( ; )I T X I X ( ; ) ( ; ( ))I X I T X

{ ( )}f x

{ ( )}f x

( )T X X

( )X T X

Page 38: Lecture1

INFORMATION THEORY 38

• Example (1) the distribution parameter is define How to show independence of X and ? Show that given T, all

sequences with k ones are equally likely, independent of .

prob. of k out of n otherwise thus

forms a Markov chain and T is a sufficient statistic.

iX k

1,..., , {0,1}n iX X X

11

1 11

1

Pr( 1)

( ,..., )

Pr ( ,..., ) ( ,..., ) |

1

0

( ,..., )

i

n

n ii

n

n n ii

i nT

X

T X X X

X X x x X k

n

k

X X X

Page 39: Lecture1

INFORMATION THEORY 39

Fano’s Inequality Suppose we know r.v. Y and wish to guess the value of

correlated r.v. X. Intuition says that is if H(X|Y) = H(X), knowing Y will not help. Conversely, if H(X|Y) = 0, then X can be estimated with no error. We now consider all the cases in between.

Let . Observe Y, p(y|x). From Y calculate form a Markov chain (X hat is

conditionally independent of X). Probability of error is defined.

( )

Pr{ }e

g Y X

X Y X

P X X

( )X p X

Page 40: Lecture1

INFORMATION THEORY 40

• (T15) Fano’s Inequality

weaker inequality

where |A| is the set size. Proof: Define error event E = 1 0 H(E,X|Y) = H(X|Y) + H(E|X,Y) (*) = 0 chain rule no error if X is known.

X X

X X

( ) log( 1) ( | )

1 log( ) ( | )

( | ) 1

log

e e

e

e

H P P A H X Y

P A H X Y

H X YP

A

Page 41: Lecture1

INFORMATION THEORY 41

Alternative Expansion

conditioning

reduces entropy

(**) Given E=1 , then H(X|Y,E = 1) is bound by the number

of remaining outcomes log (|A|-1) (T8 on slide 50) From (*) and (**) we get Fano’s inequality

X X

( | ) ( ) log (1 ) log(1 ) ( )e e e e eH E Y H E P P P P H P

( | , ) Pr( 0) ( | , 0) Pr( 1) ( | , 1)

(1 )0 log(| | 1)e e

H X E Y E H X Y E E H X Y E

P P A

( | ) ( ) log(| | 1)e eH X Y H P P A

( , | ) ( | ) ( | , )H E X Y H E Y H X E Y


Recommended