+ All Categories
Home > Documents > Information Theory Primer -...

Information Theory Primer -...

Date post: 04-Jul-2018
Category:
Upload: buicong
View: 218 times
Download: 0 times
Share this document with a friend
27
Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea [email protected] http://mlg.postech.ac.kr/seungjin 1 / 27
Transcript
Page 1: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Information Theory Primer:Entropy, KL Divergence, Mutual Information, Jensen’s inequality

Seungjin Choi

Department of Computer Science and EngineeringPohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]

http://mlg.postech.ac.kr/∼seungjin

1 / 27

Page 2: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Outline

I Entropy (Shannon entropy, differential entropy, and conditionalentropy)

I Kullback-Leibler (KL) divergence

I Mutual information

I Jensen’s inequality and Gibb’s inequality

2 / 27

Page 3: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Information Theory

I Information theory answers two fundamental questions incommunication theory

I What is the ultimate data compression? −→ entropy H.I What is the ultimate transmission rate of communication? −→

channel capacity C .

I In the early 1940’s, it was thought that increasing the transmissionrate of information over a communication channel increased theprobability of error −→ ”This is not true.”Shannon surprised the communication theory community by provingthat this was not true as long as the communication rate was belowthe channel capacity.

I Although information theory was developed for communications, it isalso important to explain ecological theory of sensory processing.Information theory plays a key role in elucidating the goal ofunsupervised learning.

3 / 27

Page 4: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Shannon Entopy

4 / 27

Page 5: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Information and Entropy

I Information can be thought of as surprise, uncertainty, orunexpectedness. Mathematically it is defined by

I = − log pi ,

where pi is the probability that the event labelled i occurs. The rareevent gives large information and frequent event produces smallinformation.

I Entropy is average information, i.e.,

H = E [I ] = −N∑i=1

pi log pi .

5 / 27

Page 6: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Example: Horse Race

Suppose we have a horse race with eight horses taking part. Assume thatthe probabilities of winning for the eight horse are

(1

2,

1

4,

1

8,

1

16,

1

64,

1

64,

1

64,

1

64).

Suppose that we wish to send a message to another person indicatingwhich horse won the race.

How many bits are required to describe this for each of the horses?

3 bits for any of the horses?

6 / 27

Page 7: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

No! The win probabilities are not uniform.It makes sense to use shorter descriptions for the more probable horses andlonger descriptions for the less probable ones so that we achieve a loweraverage description length. For example, we can use the following strings torepresent the eight horses:

0, 10, 110, 1110, 111100, 111101, 111110, 111111.

The average description length in this case is 2 bits as opposed to 3 bits for theuniform code.We calculate the entropy:

H = −1

2log2

1

2− 1

4log2

1

4− 1

8log2

1

8− 1

16log2

1

16− 4

1

64log2

1

64= 2 bits.

The entropy of a random variable is a lower bound on the average number ofbits required to represent the random variables and also on the average numberof questions needed to identify the variable in a game of ”twenty questions”.

7 / 27

Page 8: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Shannon Entropy

Given a discrete random variable X with image X ,

I (Shannon) Entropy is the average information (a measure ofuncertainty) that is defined by

H(X ) = −∑x∈X

p(x) log p(x) = Ep [− log p(x)] .

I PropertiesI H(X ) ≥ 0. (since each term in the summation is nonnegative)I H(X ) = 0 if and only if P[X = x ] = 1 for some x ∈ X .I The entropy is maximal if all the outcomes are equally likely.

8 / 27

Page 9: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Differential Entropy

Given a continuous random variable X ,

I Differential entropy is defined as

H(p) = −∫

p(x) log p(x)dx

= −Ep [log p(x)] .

I PropertiesI It can be negative.I Given a fixed variance, Gaussian distribution achieves the maximal

differential entropy.I For x ∼ N (µ, σ2), H(x) = 1

2log(2πeσ2).

I For x ∼ N (µ,Σ), H(x) = 12

log det (2πeΣ) .

9 / 27

Page 10: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Conditional Entropy

Given two discrete random variables X and Y with images X and Y,respectively, we expand the joint entropy

H(X ,Y ) =∑x∈X

∑y∈Y

p(x , y) log

(1

p(x , y)

)

=∑x∈X

∑y∈Y

p(x , y) log

(1

p(x)p(y |x)

)

=∑x∈X

∑y∈Y

p(x , y) log

(1

p(x)

)+∑x∈X

∑y∈Y

p(x , y) log

(1

p(y |x)

)

=∑x∈X

p(x) log

(1

p(x)

)+∑x∈X

p(x)∑y∈Y

p(y |x) log

(1

p(y |x)

)= H(X ) +

∑x∈X

p(x)H(Y |X = x)︸ ︷︷ ︸H(Y |X )

.

Chain rule: H(X ,Y ) = H(X ) + H(Y |X ).

10 / 27

Page 11: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

I H(X ,Y ) ≤ H(X ) + H(Y )

I H(Y |X ) ≤ H(Y )

Try to prove these by yourself!

11 / 27

Page 12: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

KL Divergence

12 / 27

Page 13: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Relative Entropy (Kullback-Leibler Divergence)

I Introduced by Solomon Kullback and Richard Leibler in 1951.

I A measure of how one probability distribution q diverges from the other p

I DKL [p‖q] (KL divergence of q(x) from p(x))

I Discrete probability distributions p and q:

DKL [p‖q] =∑

x

p(x) logp(x)

q(x).

I Probability distributions p and q of continuous random variables:

DKL [p‖q] =

∫p(x) log

p(x)

q(x)dx.

I Properties of KL divergence

I Divergence is not symmetric: DKL [p‖q] 6= DKL [q‖p].I Divergence is always nonnegative: DKL [p‖q] ≥ 0 (Gibb’s inequality).I Divergence is a convex function on the domain of probability

distributions.

13 / 27

Page 14: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Theorem (Convexity of divergence)Let p1, q1, and p2, q2, be probability distributions over a random variableX and ∀λ ∈ (0, 1) define

p = λp1 + (1− λ)p2,

q = λq1 + (1− λ)q2.

Then,

DKL [p‖q] ≤ λDKL [p1‖q1] + (1− λ)DKL [p2‖q2] .

Proof. It is deferred to the end of this lecture.

14 / 27

Page 15: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Entropy and Divergence

The entropy of a random variable X with a probability distribution p(x)is related to how much p(x) diverges from the uniform distribution onthe support of X .

H(X ) =∑x∈X

p(x) log

(1

p(x)

)=

∑x∈X

p(x) log

(|X |

p(x)|X |

)

= log |X | −∑x∈X

p(x) log

(p(x)

1|X |

)= log |X | − DKL [p‖unif] .

The more p(x) diverges the lesser its entropy and vice versa.

15 / 27

Page 16: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Recall

DKL [p‖q] =

∫p(x) log

p(x)

q(x)dx.

Characterizing KL divergence

I If p and q are high, we are happy

I If p is high but q isn’t, we pay a price

I If p is low, we do not care

I DKL = 0, then distributions are equal

16 / 27

Page 17: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

17 / 27

Page 18: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

18 / 27

Page 19: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

KL Divergence of Two Gaussians

I Two univariate Gaussians (x ∈ R)

I p(x) = N (µ1, σ21) and q(x) = N (µ2, σ

22)

I Calculated as

DKL [p‖q] =

∫p(x) log

p(x)

q(x)dx

=

∫p(x) log p(x)dx −

∫p(x) log q(x)dx

=1

2

σ21

σ22

+1

2

(µ2 − µ1)2

σ22

+ logσ1

σ2− 1

2.

I Two multivariate Gaussians (x ∈ RD)

I p(x) = N (µ1,Σ1) and q(x) = N (µ2,Σ2)I Calculated as

DKL [p‖q] =1

2

[tr(

Σ−12 Σ1

)+ (µ2 − µ1)>Σ−1

2 (µ2 − µ1)− D + log|Σ2||Σ1|

].

19 / 27

Page 20: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Mutual Information

20 / 27

Page 21: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Mutual Information

I Mutual information is the relative entropy between the jointdistribution and the product of marginal distributions,

I (x , y) =∑x∈X

∑y∈Y

p(x , y) log

[p(x , y)

p(x)p(y)

]= DKL [p(x , y)‖p(x)p(y)]

= Ep(x,y)

{log

[p(x , y)

p(x)p(y)

]}.

I Mutual information can be interpreted as the reduction in theuncertainty of x due to the knowledge of y , i.e.,

I (x , y) = H(x)− H(x |y),

where H(x |y) = −Ep(x,y) [log p(x |y)] is the conditional entropy

21 / 27

Page 22: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Convexity, Jensen’s inequality,

and Gibb’s inequality

22 / 27

Page 23: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Convex Sets and Functions

Definition (Convex Sets)Let C be a subset of Rm. C is called a convex set if

αx + (1− α)y ∈ C , ∀x, y ∈ C , ∀α ∈ [0, 1]

Definition (Convex Function)Let C be a convex subset of Rm. A function f : C 7→ R is called aconvex function if

f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y) ∀x, y ∈ C , ∀α ∈ [0, 1]

23 / 27

Page 24: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Jensen’s Inequality

Theorem (Jensen’s Inequality)If f (x) is a convex function and x is a random vector, then

E [f (x)] ≥ f (E [x]) .

Note: Jensen’s inequality can also be rewritten for a concave function,with the direction of the inequality reversed.

24 / 27

Page 25: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Proof of Jensen’s Inequality

Need to show that∑N

i=1 pi f (xi ) ≥ f(∑N

i=1 pixi)

. The proof is based on the

recursion, working from the right-hand side of this equation.

f

(N∑i=1

pixi

)= f

(p1x1 +

N∑i=2

pixi

)

≤ p1f (x1) +

[N∑i=2

pi

]f

(∑Ni=2 pixi∑Ni=2 pi

) (choose α =

p1∑Ni=1 pi

)

≤ p1f (x1) +

[N∑i=2

pi

]{αf (x2) + (1− α)f

(∑Ni=3 pixi∑Ni=3 pi

)}(

choose α =p2∑Ni=2 pi

)

= p1f (x1) + p2f (x2) +N∑i=3

pi f

(∑Ni=3 pixi∑Ni=3 pi

),

and so forth.

25 / 27

Page 26: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Gibb’s Inequality

TheoremDKL [p‖q] ≥ 0 with equality iff p = q.

Proof: Consider the Kullback-Leibler divergence for discrete distributions:

DKL [p‖q] =∑i

pi logpiqi

= −∑i

pi logqipi

≥ − log

[∑i

piqipi

](by Jensen’s inequality)

= − log

[∑i

qi

]= 0.

26 / 27

Page 27: Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

More on Gibb’s Inequality

In order to find the distribution p which minimizes DKL [p‖q], we consider aLagrangian

E = DKL [p‖q] + λ

(1−

∑i

pi

)=∑i

pi logpiqi

+ λ

(1−

∑i

pi

).

Compute the partial derivative ∂E∂pk

and set to zero,

∂E∂pk

= log pk − log qk + 1− λ = 0,

which leads to pk = qkeλ−1. It follows from

∑i pi = 1 that

∑i qie

λ−1 = 1,which leads to λ = 1. Therefore pi = qi .

The Hessian, ∂2E∂p2i

= 1pi

, ∂2E∂pi∂pj

= 0, is positive definite, which shows that

pi = qi is a genuine minimum.

27 / 27


Recommended