Information TheoryDan FleckCS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
What is Information?
The fundamental paradigm of information theory is represented here: A sender conveys information to a receiver across a communication channel.
Informationis any content to be conveyed. It is sent in the form of one or more messages.
Information theory attempts to quantify the content of messages and the capacity of channels on which information may flow
Sender
Receiver
Channel
Some Questions
Information theory asks questions such as the following:
1. How much information is encoded in a particular message?2. How efficiently can a given alphabet/language be
transmitted?3. What is the maximum capacity of a given transmission
medium?4. How is that capacity or efficiency reduced by
interference/noise?
These questions all have very precise and deep answers, that are beyond our scope here. We’ll only scratch the surface.
Why Do We Care?
Information theory is very important in computer science. It affects all communication, hardware design, protocol design, cryptography, fault-tolerance, etc.
For example, in our current context it is useful to know how much information can be transmitted over a specific covert channel. This is the “bandwidth” of the channel.
A useful way of measuring bandwidth is bits per second.
Quantifying Information: Thought ExperimentHow do you quantify the information content of a response to a “yes” or “no” question?
What does that even mean? Maybe it means the following:
1. Sender has either a “yes” or a “no.”2. Receiver knows that Sender has one of those two possibilities, but
not which.3. Sender wants to transmit enough data, but no more, to resolve
entirely Receiver’s uncertainty.4. How much data must Sender transmit?
In those terms the answer is pretty clear: One bit of data. But what must be true for Receiver to interpret the answer?
Thought Experiment Lessons
This suggests:
1. In some cases, it is possible to quantify the information content of a message.
2. Maybe an appropriate unit of information content is bits.
3. Sender and receiver must have some shared knowledge, included an agreed encoding scheme
Lessons
• Information is any content that can be conveyed from a sender to a receiver across a communication channel.
• Information theory attempts to quantify the amount of information in a message and the capacity of the channel.
• Communication requires some shared knowledge.
Information ContentDan FleckCS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Quantifying Information
How much information is contained in each of the following messages? How would you decide?
• An n-bit binary number.• A single decimal digit.• A two digit decimal number.• “The attack is at dawn.”
Sender
Receiver
Channel
Quantifying Information
Suppose Sender and Receiver have agreed that Sender will transmit one of exactly 16 messages. How many bits of information must be transmitted to convey the message? Why?
Sender
Receiver
Channel
Binary Tree Representation
• Suppose we represent all possible choices in any decision as the leaves of a “complete” binary tree. The longest path is , if there are k leaves on the tree.
• Each choice (bit) reduces the search space by half.
Encoding
That suggests that the information content of any message from a space of k messages should be something like:
But to get a very efficient transmission:• Sender and Receiver must know in advance the space of
possible transmissions.• They must have agreed on an encoding.
How Much Information?
How much information is contained in the message: “The attack is at dawn”?
Answer: It depends on the Receiver’s level of uncertainty.
• If the only uncertainty were whether at dawn or dusk: one bit.• If the attack could have come anytime during the day: ? bits.• If the day was uncertain...: ? bits.
“The word ‘information’ relates not so much to what you do say, as to what you could say. ... That is, information is the measure of your freedom of choice when you select a message.” –Warren Weaver
Lessons
• The information content of a message is the amount of uncertainty it resolves.
• In an ideal situation, each bit transmitted can reduce the uncertainty by half.
• Very few circumstances are ideal in the required sense
Exploring EncodingsDan FleckCS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Quantifying Information
Information content is commonly measured in terms of bits. “Bit” has two connotations, which are not the same:
bit1: a binary digit (discrete);bit2: a quantity of information (continuous).
The information content of a message is measured in bit2s and the capacity of a channel in bit2s per second (bps).
In general, the best way of transmitting (encoding) a series of messages is that way that minimizes the number of bit2s required, on average.
Finding an Encoding
Four bits are adequate to encode 16 possible messages: M0, . . ., M15:
Call this the naïve encoding.• Can we do better for one message? What would that mean?• How about transmitting n messages, each of which is one of 16
possible values?
Probability and Encoding
Suppose you need to send 1000 messages, each of which can be one of 16 possibilities. But on average 99.5% will be message 10.
Question: Does it still require 4 × 1000 = 4000 bits to send your 1000 messages?
Answer: It is possible to come up with an encoding that will do better on average than the naïve encoding.
Note, when we talk about sending 1000 messages, we’ve gone from talking about the information content of a message to talking about that of a language.
A Better Encoding
• Use the following encoding:
• Given 1000 messages, on average 995 of them will be message 10, and 5 will be other messages. This encoding takes 995 + (5 · 5) = 1020 bits or 1.02 bits per message.
Some Observations
• Our encoding is pretty good, but can we do even better? Is there a limit to how well we can do?
• Computing the number of bits per message depends on knowing the prior probabilities—how often each message appears in an arbitrarily long sequence of messages.
• The “on average” part is important; some sequences would be less efficient under our encoding.
• We used the “naïve encoding” as our benchmark, but there are much worse encodings.
• Is it possible to find an optimal encoding? What would that mean?
Lessons
• “Bit” has two distinct meanings that are easily confused.
• For any language, one can find a naïve encoding that will work, but it’s often possible to do better.
• “Doing better” means using fewer bits, on average, to transmit messages in the language.
Languages & EncodingsDan FleckCS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Languages
A language may refer to a natural language such as English or Japanese. But it may also refer to any set of symbols used to form “strings.”
Often, we’ll use “language” to refer to the results of a series of experiments.
Example 1: Assume a two-sided coin. Symbols in the language are “H” (heads) and “T” (tails). A string describes the result of repeatedly flipping the coin. E.g., “HHTHTTTHTHTTHTT...”
Example 2: Assume a six-sided die. Symbols in the language are “1”, “2”, “3”, “4”, “5”, “6”. A string describes the result of repeatedly rolling the die. E.g., “425146261435261...”
Encoding Properties
For any language, our goal is a binary encoding, with the following properties:
lossless: it must be possible to recover the entire original sequence of symbols from the transmission;
uniquely decodable: for any encoded string, there must be only one possible decoding;
streaming: there should be no breaks in the encoding.
Unique Decodability
Suppose you roll a 6-sided die repeatedly. The following are some possible codes. Are they lossless, uniquely decodable, streaming? Which is “best”?
Sufficient (but not necessary) for unique decodability is the property of being prefix-free: That is, the string representing any symbol cannot be an initial prefix of the string representing any other symbol.
Unique DecodabilityImagine a language containing the symbols A, B, C. What’s wrong with the following encoding? A clue is that it is not prefix-free.
An encoding can fail to have the prefix-free property, but still be uniquely decodable.
Parsing such a language may require arbitrary “look-ahead.”
Sym Code
A 1
B 0
C 10
Sym Code
A 1
C 1111111110
Finding a Coding
How do you come up with an efficient encoding? Use fewer bits for the symbols that occur more frequently.
Samuel Morse knew this instinctively. But Morse code doesn’t satisfy our criteria for encodings. Why not?
Huffman coding is guaranteed to find an efficient code for a given language if you know the probabilities of symbols in the language.
E . S …
T - R . - .
M -- Q --.-
Lessons
• We use “language” for any scheme to generate a series of symbols.
• For any language, we want an efficient binary encoding that is lossless, uniquely decodable and streaming.
• Huffman encoding will provide such an encoding assuming we know the probabilities of symbols in the language.
EntropyDan FleckCS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
An Encoding Example
Recall the following language:
Example 1: Assume a two-sided coin. Symbols in the language are “H” (heads) and “T” (tails). A string describes the result of repeatedly flipping the coin. E.g., “HHTHTTTHTHTTHTT...”
Suppose you know further that it’s a fair coin.
The following is the naïve encoding for this language. Is there a better encoding? How would you know?
Result Prob Code
H 1/2 0
T 1/2 1
Entropy
Claim: If you know the probabilities of symbols in the language, you can compute the average information content of a symbol in the language.
For our fair coin example, the computed answer is 1 bit per symbol.
Definition: The entropy of a language is a measure of the information content of an average symbol in the language.
Computing EntropyEntropy H is computed as follows. If pi is the probability of the ith symbol in the language, then
From now on, all log’s are base 2.Example: Consider our fair coin example. Find the entropy.
Solution: There are two symbols, each with probability 1/2, so:h = −(1/2 × log 1/2 + 1/2 × log 1/2) = 1
Result Prob Code
H 1/2 0
T 1/2 1
What Does It Mean?
What does it mean to say that the entropy of this language is 1?
1. On average, there is one bit of information in each symbol of the language.
2. It is impossible to find an encoding that uses less than one bit per symbol, on average.
3. Any encoding that uses one bit per symbol, on average, is optimal.
Therefore, for this example, the naïve encoding is the optimal encoding.
Entropy Special Case
Whenever you have n symbols, all equally probable, the probability of any of them is 1/n. The language has entropy:
h = −(log 1/n) = log n
For example, a fair die with six sides has entropy:
h = −(log 1/6) = log 6 ≈ 2.58
Hence, it requires 2.58 bits, on average, to transmit the result of a roll.
Another Example
Suppose we have an unbalanced coin that is three times more likely to yield a head than a tail. What is the entropy of this language?
Solution: Again there are two possible outcomes:
The entropy is computed as follows:
h = −(3/4 × log 3/4 + 1/4 × log 1/4) ≈ 0.811
This says that it’s theoretically impossible to encode this language using less than 0.811 bits (on average) to transmit the result of each toss of the coin.
Result Prob
H 3/4
T 1/4
Lessons
• Given the probabilities of symbols in a language, you can compute its entropy.
• Entropy measures the average information content of symbols in the language.
• Entropy sets a lower limit on encoding efficiency.
Example Revisited
Given: an unbalanced coin that is three times more likely to yield a head than a tail.
Solution: There are two possible outcomes:
The entropy is computed as follows:
h = −(3/4 × log 3/4 + 1/4 × log 1/4) ≈ 0.811
Upshot: It’s theoretically impossible to encode this language using less than 0.811 bits per symbol (on average).
But how would you ever do better than 1 bit / symbol?
Result Prob
H 3/4
T 1/4
Example Revisited
Instead of taking single flips as our “experiments,” let’s take pairs (call them “2flips”) and code as follows:
Suppose we flip the coin 32 times; that’s 16 2flips. In an average run, we’d expect: HH to appear 9 times, HT and TH to each appear 3 times, and TT to appear once. Why?
Result Prob Code
HH 9/16 0
HT 3/16 10
TH 3/16 110
TT 1/16 111
Example Revisited
Given 32 flips (16 2flips), we could expect:
For the naïve encoding, using 1 bit / flip, we’d expect to use 32 bits. Our efficiency is 27/32 ≈ 0.844, which is not a bad approximation of the entropy (0.811).
Could we do better? Sure, just use 3flips, 4flips, etc. The entropy is the limit of this process.
Result Count Code Bits
HH 9 0 9
HT 3 10 6
TH 3 110 9
TT 1 111 3
Total: 27bits
Test Your Understanding
Suppose you have a six-sided die that is unbalanced such that 1 and 2 are equally likely; 3 and 4 are equally likely; and 5 and 6 are equally likely. However, the die rolls 1 twice as often as 3, and rolls 3 three times as often as 5.
1. What is the “naïve ” encoding for this language?2. What is the entropy of this language?3. Find an encoding that is more efficient than the naïve
encoding.4. Give a convincing argument that your encoding is more
efficient than the naive encoding.
Hint: There’s no need to encode sequences of rolls
Lessons
• Computing the entropy of a language provides a bound on the efficiency of any encoding.
• But, finding an efficient encoding requires ingenuity.
Fundamental TheoremsDan FleckCS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Source Entropy
The entropy of a language is a measure of the most efficient possible encoding of the language.
The entropy of a message source is the amount of information content that the source can produce in a given period. This is measured in bits per second.
Any channel can transmit an arbitrary amount of information, given enough time and unbounded buffering capacity. But can a given channel transmit the information in real time?
Channel Capacity
The capacity of a channel is the number of bits that can be sent per second over the channel. This is a property of the communication medium.
Fundamental Theorem of the Noiseless Channel. (Shannon): If a language has entropy h (bits per symbol) and a channel can transmit C bits per second, then it is possible to encode the signal is such a way as to transmit at an average rate of (C/h) − ε symbols per second, where ε can be made arbitrarily small. It is impossible to transmit at an average rate greater than C/h.
What the Theorem Means
Suppose a channel can handle 100 bits / second and your language has entropy 5 (bit per symbol).
Given a perfect encoding and a noiseless channel, you’d expect to be able to transmit 20 symbols / second though the channel, on average. Right?
But you may not have a perfect encoding. Doesn’t matter. You can always find a better encoding to get within ε of that limit.
Noisy Channels
If the channel is noisy, the capacity is reduced by the noise. But the following is true:
Fundamental Theorem of a Noisy Channel (Shannon): Let a discrete channel have a capacity C and a discrete source an entropy h (bits per second). If h < C there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrarily small frequency of errors.
If the channel (with noise factored in) can physically handle the message traffic, then it is possible to transmit with arbitrarily small error rate
What this Theorem Means
The upshot of this is that a message can be transmitted reliably over even a very noisy channel by increasing the redundancy of the coding scheme.
For example, covert channels in the system cannot be dismissed with the argument that they are noisy and hence useless. You can always get the message through by finding a more redundant encoding.
Lessons
• Entropy provides a bound on coding efficiency.
• But Shannon’s theorems show that it is always possible to approach that limit arbitrarily closely.
Entropy of EnglishDan FleckCS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Entropy vs. Redundancy
Intuitively, the difference between the efficiency of the encoding and the entropy is a measure of the redundancy in the encoding.
If you find an encoding with efficiency matching the entropy and there is no redundancy.
The standard encoding for English contains a lot of redundancy.Fr xmpl, y cn prbbly gss wht ths sntnc sys, vn wth ll f th vwls mssng. Tht ndcts tht th nfrmtn cntnt cn b xtrctd frm th rmnng smbls.
The E ect of Redundancyff
The following also illustrates the redundancy of English text:
Aoccdrnig to rscheearch at Cmabirgde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a ttoal mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh?
Spammers count on the ability of humans to decipher such text, and the inability of computers to do so to defeat anti-spam filters:
Care to order some Vi@gra or Vigara?
Entropy of English: Zero-Order Model
Suppose we want to transmit English text (26 letters and a space). If we assume that all characters are equally likely, the entropy is:
h = −(log 1/27) = 4.75
This is the zero-order model of English.
This gives an approximation to the entropy. But the underlying assumption is clearly false.
Entropy of English: First-Order
In written or spoken English, some symbols occur much more frequently than others.
Assuming that all symbols are independent of one another, but follow the probabilities above, the entropy is 4.219 bits per symbol. This is the “first-order” model of English.
Entropy of English: Higher-OrderThe assumption of independence (zero memory) is also incorrect. Some letters follow other letters frequently; others not at all.
The following shows the most common digrams and trigrams in English.
Entropy of English: Higher-Order
We can compute tables of the likelihood of digrams (two-letter combinations), trigrams, etc. Adding digrams to the computation gives a second-order model; adding trigrams gives a third-order model; etc.
A third-order model yields 2.77 bits per symbol. The actual entropy is the “limit” of this process of taking higher and higher order models.
Estimates by Shannon based on human experiments have yielded values as low as 0.6 to 1.3 bits per symbol.
Lessons
• All natural languages contain significant redundancy.
• Computing the entropy of a natural language is difficult and requires complex models.
Entropy odds and endsDan FleckCS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Entropy Aside
Recall that information content of a message depends on the state of knowledge of the receiver. Hence, entropy is relative to a particular observer.
Consider the entropy of the contents of an envelope marked “Best Picture” at the Academy Awards (assuming 5 nominees):
• If all were equally likely to win, the entropy would be log 5 ≈ 2.322.• For everyone who knows that the odds aren’t even, it’s less, though
hard to compute.• For the auditors who stuffed the envelope, it’s 0 since they have no
uncertainty.
Often, prior probabilities are impossible to compute.
Entropy and Randomness
Note that entropy can be used to measure the amount of “redundancy” in the encoding. If the information content of a message is equal to the length of the encoded message, there is no redundancy.
Some sources define a random string as one that cannot be represented any more efficiently. (I.e., no compression is possible.)
Finding a Coding
Huffman coding is guaranteed to find an efficient code for a given language assuming you know the probabilities of language units.
In fact, it always uses less than one bit per symbol more than the entropy, which is extremely efficient.
Lempel-Ziv is an “adaptive coding” algorithm used in many commercial text compression utilities. It builds an encoding on the fly according to the strings it encounters.
Lempel-Ziv is asymptotically optimal. That is, as the text length tends to infinity, the compression approaches optimal.
Lessons
• The information content of a message is relative to the state of knowledge of an observer.
• If an encoding’s efficiency matches the entropy, there is no redundancy to compress out.
• Huffman coding and the Lempel Ziv algorithms both give highly efficient codes.