Entropy & Information - Universität des Saarlandes€¦ · entropy mutual information divergence...

Post on 16-Jun-2020

2 views 0 download

transcript

Entropy & Information Jilles Vreeken

29 May 2015

Question of the day

What is

information?

(and what do talking drums have to do with it?)

Bits and Pieces What are information a bit entropy mutual information divergence information theory …

Information Theory

Field founded by Claude Shannon in 1948, ‘A Mathematical Theory of Communication’

a branch of statistics that is essentially about

uncertainty in communication

not what you say, but what you could say

The Big Insight

Communication is a series of discrete messages

each message reduces

the uncertainty of the recipient of a) the series and b) that message

by how much

is the amount of information

Uncertainty

Shannon showed that uncertainty can be quantified, linking physical entropy to messages

and defined the entropy of

a discrete random variable 𝑋 as

𝐻(𝑋) = −�𝑃(𝑥𝑖)log 𝑃(𝑥𝑖)𝑖

Optimal prefix-codes

Shannon showed that uncertainty can be quantified, linking physical entropy to messages

A (key) result of Shannon entropy is that

− log2𝑃 𝑥𝑖

gives the length in bits of the optimal prefix code

for a message 𝑥𝑖

Codes and Lengths

A code 𝐶 maps a set of messages 𝑋 to a set of code words 𝑌

𝐿𝐶 ⋅ is a code length function for 𝐶

with 𝐿𝐶 𝑥 ∈ 𝑋 = |𝐶 𝑥 ∈ 𝑌| the length in bits of the code word y ∈ 𝑌 that 𝐶 assigns to symbol 𝑥 ∈ 𝑋.

Efficiency Not all codes are created equal. Let 𝐶1 and 𝐶2 be two codes for set of messages 𝑋 1. We call 𝐶1 more efficient than 𝐶2 if for all 𝑥 ∈ 𝑋, 𝐿1 𝑥 ≤

𝐿2(𝑥) while for at least one 𝑥 ∈ 𝑋, 𝐿1 𝑥 < 𝐿2 𝑥 2. We call a code 𝐶 for set 𝑋 complete if there does not exist a

code 𝐶𝐶 that is more efficient than 𝐶

A code is complete when it does not waste any bits

The Most Important Slide

We only care about code lengths

The Most Important Slide

Actual code words are of no interest to us whatsoever.

The Most Important Slide

Our goal is measuring complexity,

not to instantiate an actual compressor

My First Code Let us consider a sequence 𝑆

over a discrete alphabet 𝑋 = 𝑥1, 𝑥2, … , 𝑥𝑚 .

As code 𝐶 for 𝑆 we can instantiate a block code, identifying the value of 𝑠𝑖 ∈ 𝑆 by an index over 𝑋, which require a constant number of log2 |𝑋| bits

per message in 𝑆, i.e., 𝐿 𝑥𝑖 = log2 |𝑋|

We can always instantiate a prefix-free code with code words of lengths 𝐿 𝑥𝑖 = log2 |𝑋|

Codes in a Tree

root

0 00 01

1 10 11

Beyond Uniform What if we know

the distribution 𝑃(𝑥𝑖 ∈ 𝑋) over 𝑆 and it is not uniform?

We do not want to waste any bits, so using block codes is a bad idea.

We do not want to introduce any undue bias, so

we want an efficient code that is uniquely decodable without having to use arbitrary length stop-words.

We want an optimal prefix-code.

Prefix Codes A code 𝐶 is a prefix code iff there is no code word 𝐶 𝑥 that is an extension of another code word 𝐶(𝑥′). Or, in other words, 𝐶 defines a binary tree with the leaves as the code words. How do we find the optimal tree?

root

0 00 01

1

Shannon Entropy Let 𝑃(𝑥𝑖) be the probability of 𝑥𝑖 ∈ 𝑋 in 𝑆, then

𝐻(𝑆) = − � 𝑃(𝑥𝑖)log 𝑃(𝑥𝑖)𝑥𝑖∈𝑋

is the Shannon entropy of 𝑆 (wrt 𝑋)

(see Shannon 1948)

the ‘weight’, how often we see 𝑥𝑖

number of bits needed to identify 𝑥𝑖 under 𝑃

average number of bits needed per message 𝑠𝑖 ∈ 𝑆

Optimal Prefix Code Lengths What if the distribution of 𝑋 in 𝑆 is not uniform?

Let 𝑃(𝑥𝑖) be the probability of 𝑥𝑖 in 𝑆, then

𝐿(𝑥𝑖) = − log𝑃(𝑥𝑖)

is the length of the optimal prefix code for message 𝑥𝑖 knowing distribution 𝑃

(see Shannon 1948)

Kraft’s Inequality For any code C for finite alphabet 𝑋 = 𝑥1, … , 𝑥𝑚 ,

the code word lengths 𝐿𝐶 ⋅ must satisfy the inequality

� 2−𝐿(𝑥𝑖)

𝑥𝑖∈𝑋

≤ 1.

a) when a set of code word lengths satisfies the inequality,

there exists a prefix code with these code word lengths, b) when it holds with strict equality, the code is complete,

it does not waste any part of the coding space, c) when it does not hold, the code is not uniquely decodable

What’s a bit?

Binary digit smallest and most fundamental piece of information yes or no invented by Claude Shannon in 1948 name by John Tukey Bits have been in use for a long-long time, though Punch cards (1725, 1804) Morse code (1844) African ‘talking drums’

Morse code

Natural language

Punishes ‘bad’ redundancy: often-used words are shorter

Rewards useful redundancy:

cotxent alolws mishaireng/raeding

African Talking Drums have used this for efficient, fast, long-distance communication

mimic vocalized sounds: tonal language very reliable means of communication

Measuring bits

How much information carries a given string? How many bits?

Say we have a binary string of 10000 ‘messages’

1) 00010001000100010001…000100010001000100010001000100010001 2) 01110100110100100110…101011101011101100010110001011011100 3) 00011000001010100000…001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000…0000000

obviously, all four are 10000 bits long. But, are they worth those 10000 bits?

So, how many bits?

Depends on the encoding!

What is the best encoding? one that takes the entropy of the data into account things that occur often should get short code things that occur seldom should get long code

An encoding matching Shannon Entropy is optimal

Tell us! How many bits? Please? In our simplest example we have

𝑃(1) = 1/100000 𝑃(0) = 99999/100000

|𝑐𝑐𝑐𝑐1| = −log (1/100000) = 16.61

|𝑐𝑐𝑐𝑐0| = −log (99999/100000) = 0.0000144

So, knowing 𝑃 our string contains

1 ∗ 16.61 + 99999 ∗ 0.0000144 = 18.049 bits

of information

Optimal….

Shannon lets us calculate optimal code lengths what about actual codes? 0.0000144 bits? Shannon and Fano invented a near-optimal encoding in 1948,

within one bit of the optimal, but not lowest expected

Fano gave students an option: regular exam, or invent a better encoding

David didn’t like exams; invented Huffman-codes (1952) optimal for symbol-by-symbol encoding with fixed probs.

(arithmetic coding is overall optimal, Rissanen 1976)

Optimality

To encode optimally, we need optimal probabilities

What happens if we don’t?

Measuring Divergence

Kullback-Leibler divergence from 𝑄 to 𝑃, denoted by 𝐷(𝑃 ‖ 𝑄), measures the number of bits

we ‘waste’ when we use 𝑄 while 𝑃 is the ‘true’ distribution

𝐷 𝑃 ‖ 𝑄 = �𝑃(𝑖) log𝑃 𝑖𝑄 𝑖

𝑖

Multivariate Entropy

So far we’ve been thinking about a single sequence of messages

How does entropy work for

multivariate data?

Simple!

Towards Mutual Information

Conditional Entropy is defined as

𝐻 𝑋 𝑌 = �𝑃 𝑥 𝐻(𝑌|𝑋 = 𝑥)𝑥∈X

‘average number of bits

needed for message 𝑥 ∈ 𝑋 knowing 𝑌𝐶

Symmetric

Mutual Information the amount of information shared between two variables 𝑋 and 𝑌

𝐼 𝑋,𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋

= ��𝑃 𝑥,𝑦 log𝑃 𝑥,𝑦𝑃 𝑥 𝑃 𝑦

𝑥∈𝑋𝑦∈𝑌

high 𝐼(𝑋,𝑌) implies correlation low 𝐼(𝑋,𝑌) implies independence

Information is symmetric!

Information Gain (small aside)

Entropy and KL are used in decision trees

What is the best split in a tree?

one that results in as homogeneous label distributions in the sub-nodes as possible: minimal entropy

How do we compare over multiple options? 𝐼𝐼 𝑇,𝑎 = 𝐻 𝑇 − 𝐻(𝑇|𝑎)

Goal: Finds sets of attributes that interact strongly

Task: mine all sets of attributes

such that the entropy over their values instantiations ≤ 𝜎

1.087 bits

Low-Entropy Sets

Theory of Computation

Probability Theory 1

No No 1887

Yes No 156

No Yes 143

Yes Yes 219

(Heikinheimo et al. 2007)

Low-Entropy Sets

Maturity Test Software Engineering

Theory of Computation

No No No 1570

Yes No No 79

No Yes No 99

Yes Yes No 282

No No Yes 28

Yes No Yes 164

No Yes Yes 13

Yes Yes Yes 170

(Heikinheimo et al. 2007)

Low-Entropy Trees Scientific Writing

Maturity Test

Software Engineering

Project

Theory of Computation

Probability Theory 1

(Heikinheimo et al. 2007)

Define entropy of a tree 𝑇 = 𝐴,𝑇1, … ,𝑇𝑘 as

𝐻𝑈 𝑇 = 𝐻(𝐴 ∣ A1, … , Ak) + ∑𝐻𝑈(𝑇𝑗)

The tree 𝑇 for an itemset 𝐴 minimizing 𝐻𝑈 𝑇 identifies directional explanations!

𝐻 𝐴 ≤ 𝐻(𝑆𝑆|𝑀𝑇, 𝑆𝑆𝑃,𝑇𝐶,𝑃𝑇) + 𝐻(𝑀𝑇|𝑆𝑆𝑃,𝑇𝐶,𝑃𝑃) + 𝐻 𝑆𝑆𝑃 + 𝐻 𝑇𝐶 𝑃𝑇 + 𝐻 𝑃𝑇

Entropy for Continuous-values

So far we only considered discrete-valued data

Lots of data is continuous-valued

(or is it)

What does this mean for entropy?

Differential Entropy

ℎ 𝑋 = −�𝑓 𝑥 log𝑓 𝑥 𝑐𝑥𝐗

(Shannon, 1948)

Differential Entropy

How about… the entropy of Uniform(0,1/2) ?

−� −2 log 2 𝑐𝑥 = − log 212

0

Hm, negative?

Differential Entropy

In discrete data step size ‘dx’ is trivial.

What is its effect here?

ℎ 𝑋 = −�𝑓 𝑥 log𝑓 𝑥 𝑐𝑥

𝐗

(Shannon, 1948)

Impossibru?

No.

But you’ll have to wait

till next week for the answer.

Conclusions

Information is related to the reduction in uncertainty of what you could say

Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc

Entropy for continuous data is… more tricky differential entropy is a bit problematic

Thank you! Information is related to the reduction in uncertainty

of what you could say Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc

Entropy for continuous data is… more tricky differential entropy is a bit problematic