+ All Categories
Home > Documents > Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence...

Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence...

Date post: 16-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
42
Entropy & Information Jilles Vreeken 29 May 2015
Transcript
Page 1: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Entropy & Information Jilles Vreeken

29 May 2015

Page 2: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Question of the day

What is

information?

(and what do talking drums have to do with it?)

Page 3: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Bits and Pieces What are information a bit entropy mutual information divergence information theory …

Page 4: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Information Theory

Field founded by Claude Shannon in 1948, ‘A Mathematical Theory of Communication’

a branch of statistics that is essentially about

uncertainty in communication

not what you say, but what you could say

Page 5: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

The Big Insight

Communication is a series of discrete messages

each message reduces

the uncertainty of the recipient of a) the series and b) that message

by how much

is the amount of information

Page 6: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Uncertainty

Shannon showed that uncertainty can be quantified, linking physical entropy to messages

and defined the entropy of

a discrete random variable 𝑋 as

𝐻(𝑋) = −�𝑃(𝑥𝑖)log 𝑃(𝑥𝑖)𝑖

Page 7: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Optimal prefix-codes

Shannon showed that uncertainty can be quantified, linking physical entropy to messages

A (key) result of Shannon entropy is that

− log2𝑃 𝑥𝑖

gives the length in bits of the optimal prefix code

for a message 𝑥𝑖

Page 8: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Codes and Lengths

A code 𝐶 maps a set of messages 𝑋 to a set of code words 𝑌

𝐿𝐶 ⋅ is a code length function for 𝐶

with 𝐿𝐶 𝑥 ∈ 𝑋 = |𝐶 𝑥 ∈ 𝑌| the length in bits of the code word y ∈ 𝑌 that 𝐶 assigns to symbol 𝑥 ∈ 𝑋.

Page 9: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Efficiency Not all codes are created equal. Let 𝐶1 and 𝐶2 be two codes for set of messages 𝑋 1. We call 𝐶1 more efficient than 𝐶2 if for all 𝑥 ∈ 𝑋, 𝐿1 𝑥 ≤

𝐿2(𝑥) while for at least one 𝑥 ∈ 𝑋, 𝐿1 𝑥 < 𝐿2 𝑥 2. We call a code 𝐶 for set 𝑋 complete if there does not exist a

code 𝐶𝐶 that is more efficient than 𝐶

A code is complete when it does not waste any bits

Page 10: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

The Most Important Slide

We only care about code lengths

Page 11: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

The Most Important Slide

Actual code words are of no interest to us whatsoever.

Page 12: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

The Most Important Slide

Our goal is measuring complexity,

not to instantiate an actual compressor

Page 13: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

My First Code Let us consider a sequence 𝑆

over a discrete alphabet 𝑋 = 𝑥1, 𝑥2, … , 𝑥𝑚 .

As code 𝐶 for 𝑆 we can instantiate a block code, identifying the value of 𝑠𝑖 ∈ 𝑆 by an index over 𝑋, which require a constant number of log2 |𝑋| bits

per message in 𝑆, i.e., 𝐿 𝑥𝑖 = log2 |𝑋|

We can always instantiate a prefix-free code with code words of lengths 𝐿 𝑥𝑖 = log2 |𝑋|

Page 14: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Codes in a Tree

root

0 00 01

1 10 11

Page 15: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Beyond Uniform What if we know

the distribution 𝑃(𝑥𝑖 ∈ 𝑋) over 𝑆 and it is not uniform?

We do not want to waste any bits, so using block codes is a bad idea.

We do not want to introduce any undue bias, so

we want an efficient code that is uniquely decodable without having to use arbitrary length stop-words.

We want an optimal prefix-code.

Page 16: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Prefix Codes A code 𝐶 is a prefix code iff there is no code word 𝐶 𝑥 that is an extension of another code word 𝐶(𝑥′). Or, in other words, 𝐶 defines a binary tree with the leaves as the code words. How do we find the optimal tree?

root

0 00 01

1

Page 17: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Shannon Entropy Let 𝑃(𝑥𝑖) be the probability of 𝑥𝑖 ∈ 𝑋 in 𝑆, then

𝐻(𝑆) = − � 𝑃(𝑥𝑖)log 𝑃(𝑥𝑖)𝑥𝑖∈𝑋

is the Shannon entropy of 𝑆 (wrt 𝑋)

(see Shannon 1948)

the ‘weight’, how often we see 𝑥𝑖

number of bits needed to identify 𝑥𝑖 under 𝑃

average number of bits needed per message 𝑠𝑖 ∈ 𝑆

Page 18: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Optimal Prefix Code Lengths What if the distribution of 𝑋 in 𝑆 is not uniform?

Let 𝑃(𝑥𝑖) be the probability of 𝑥𝑖 in 𝑆, then

𝐿(𝑥𝑖) = − log𝑃(𝑥𝑖)

is the length of the optimal prefix code for message 𝑥𝑖 knowing distribution 𝑃

(see Shannon 1948)

Page 19: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Kraft’s Inequality For any code C for finite alphabet 𝑋 = 𝑥1, … , 𝑥𝑚 ,

the code word lengths 𝐿𝐶 ⋅ must satisfy the inequality

� 2−𝐿(𝑥𝑖)

𝑥𝑖∈𝑋

≤ 1.

a) when a set of code word lengths satisfies the inequality,

there exists a prefix code with these code word lengths, b) when it holds with strict equality, the code is complete,

it does not waste any part of the coding space, c) when it does not hold, the code is not uniquely decodable

Page 20: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

What’s a bit?

Binary digit smallest and most fundamental piece of information yes or no invented by Claude Shannon in 1948 name by John Tukey Bits have been in use for a long-long time, though Punch cards (1725, 1804) Morse code (1844) African ‘talking drums’

Page 21: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Morse code

Page 22: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Natural language

Punishes ‘bad’ redundancy: often-used words are shorter

Rewards useful redundancy:

cotxent alolws mishaireng/raeding

African Talking Drums have used this for efficient, fast, long-distance communication

mimic vocalized sounds: tonal language very reliable means of communication

Page 23: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Measuring bits

How much information carries a given string? How many bits?

Say we have a binary string of 10000 ‘messages’

1) 00010001000100010001…000100010001000100010001000100010001 2) 01110100110100100110…101011101011101100010110001011011100 3) 00011000001010100000…001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000…0000000

obviously, all four are 10000 bits long. But, are they worth those 10000 bits?

Page 24: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

So, how many bits?

Depends on the encoding!

What is the best encoding? one that takes the entropy of the data into account things that occur often should get short code things that occur seldom should get long code

An encoding matching Shannon Entropy is optimal

Page 25: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Tell us! How many bits? Please? In our simplest example we have

𝑃(1) = 1/100000 𝑃(0) = 99999/100000

|𝑐𝑐𝑐𝑐1| = −log (1/100000) = 16.61

|𝑐𝑐𝑐𝑐0| = −log (99999/100000) = 0.0000144

So, knowing 𝑃 our string contains

1 ∗ 16.61 + 99999 ∗ 0.0000144 = 18.049 bits

of information

Page 26: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Optimal….

Shannon lets us calculate optimal code lengths what about actual codes? 0.0000144 bits? Shannon and Fano invented a near-optimal encoding in 1948,

within one bit of the optimal, but not lowest expected

Fano gave students an option: regular exam, or invent a better encoding

David didn’t like exams; invented Huffman-codes (1952) optimal for symbol-by-symbol encoding with fixed probs.

(arithmetic coding is overall optimal, Rissanen 1976)

Page 27: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Optimality

To encode optimally, we need optimal probabilities

What happens if we don’t?

Page 28: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Measuring Divergence

Kullback-Leibler divergence from 𝑄 to 𝑃, denoted by 𝐷(𝑃 ‖ 𝑄), measures the number of bits

we ‘waste’ when we use 𝑄 while 𝑃 is the ‘true’ distribution

𝐷 𝑃 ‖ 𝑄 = �𝑃(𝑖) log𝑃 𝑖𝑄 𝑖

𝑖

Page 29: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Multivariate Entropy

So far we’ve been thinking about a single sequence of messages

How does entropy work for

multivariate data?

Simple!

Page 30: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Towards Mutual Information

Conditional Entropy is defined as

𝐻 𝑋 𝑌 = �𝑃 𝑥 𝐻(𝑌|𝑋 = 𝑥)𝑥∈X

‘average number of bits

needed for message 𝑥 ∈ 𝑋 knowing 𝑌𝐶

Symmetric

Page 31: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Mutual Information the amount of information shared between two variables 𝑋 and 𝑌

𝐼 𝑋,𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋

= ��𝑃 𝑥,𝑦 log𝑃 𝑥,𝑦𝑃 𝑥 𝑃 𝑦

𝑥∈𝑋𝑦∈𝑌

high 𝐼(𝑋,𝑌) implies correlation low 𝐼(𝑋,𝑌) implies independence

Information is symmetric!

Page 32: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Information Gain (small aside)

Entropy and KL are used in decision trees

What is the best split in a tree?

one that results in as homogeneous label distributions in the sub-nodes as possible: minimal entropy

How do we compare over multiple options? 𝐼𝐼 𝑇,𝑎 = 𝐻 𝑇 − 𝐻(𝑇|𝑎)

Page 33: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Goal: Finds sets of attributes that interact strongly

Task: mine all sets of attributes

such that the entropy over their values instantiations ≤ 𝜎

1.087 bits

Low-Entropy Sets

Theory of Computation

Probability Theory 1

No No 1887

Yes No 156

No Yes 143

Yes Yes 219

(Heikinheimo et al. 2007)

Page 34: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Low-Entropy Sets

Maturity Test Software Engineering

Theory of Computation

No No No 1570

Yes No No 79

No Yes No 99

Yes Yes No 282

No No Yes 28

Yes No Yes 164

No Yes Yes 13

Yes Yes Yes 170

(Heikinheimo et al. 2007)

Page 35: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Low-Entropy Trees Scientific Writing

Maturity Test

Software Engineering

Project

Theory of Computation

Probability Theory 1

(Heikinheimo et al. 2007)

Define entropy of a tree 𝑇 = 𝐴,𝑇1, … ,𝑇𝑘 as

𝐻𝑈 𝑇 = 𝐻(𝐴 ∣ A1, … , Ak) + ∑𝐻𝑈(𝑇𝑗)

The tree 𝑇 for an itemset 𝐴 minimizing 𝐻𝑈 𝑇 identifies directional explanations!

𝐻 𝐴 ≤ 𝐻(𝑆𝑆|𝑀𝑇, 𝑆𝑆𝑃,𝑇𝐶,𝑃𝑇) + 𝐻(𝑀𝑇|𝑆𝑆𝑃,𝑇𝐶,𝑃𝑃) + 𝐻 𝑆𝑆𝑃 + 𝐻 𝑇𝐶 𝑃𝑇 + 𝐻 𝑃𝑇

Page 36: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Entropy for Continuous-values

So far we only considered discrete-valued data

Lots of data is continuous-valued

(or is it)

What does this mean for entropy?

Page 37: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Differential Entropy

ℎ 𝑋 = −�𝑓 𝑥 log𝑓 𝑥 𝑐𝑥𝐗

(Shannon, 1948)

Page 38: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Differential Entropy

How about… the entropy of Uniform(0,1/2) ?

−� −2 log 2 𝑐𝑥 = − log 212

0

Hm, negative?

Page 39: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Differential Entropy

In discrete data step size ‘dx’ is trivial.

What is its effect here?

ℎ 𝑋 = −�𝑓 𝑥 log𝑓 𝑥 𝑐𝑥

𝐗

(Shannon, 1948)

Page 40: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Impossibru?

No.

But you’ll have to wait

till next week for the answer.

Page 41: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Conclusions

Information is related to the reduction in uncertainty of what you could say

Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc

Entropy for continuous data is… more tricky differential entropy is a bit problematic

Page 42: Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence information theory … Information Theory . Field founded by Claude Shannon in 1948,

Thank you! Information is related to the reduction in uncertainty

of what you could say Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc

Entropy for continuous data is… more tricky differential entropy is a bit problematic


Recommended