Date post: | 12-Apr-2017 |
Category: |
Science |
Upload: | nidhal-el-abbadi |
View: | 641 times |
Download: | 0 times |
October 12, 2015 1 [email protected]
October 12, 2015 2 [email protected]
Introduction
What is information
Define data compression
History of compression technologies
Why do we still need compression
What is it possible to compress data
Lossless vs. lossy compression
Compression performance
Intuitive compression
How many bits per symbol
Information theory
Entropy
Contents
October 12, 2015 3
Helpful Knowledge
• Algorithm Design and Analysis
• Probability
Resources
Text Book
• Khalid Sayood, Introduction to Data compression, Fourth Edition,
Morgan Kaufmann Publishers, 2012.
• David Salomon, Data Compression The Complete Reference,
Fourth Edition, Springer-Verlag London Limited, 2007.
Papers and Sections from Books
October 12, 2015 4
Data can be characters in a text file, numbers that are samples of speech
or image waveforms, or sequences of numbers that are generated by
other processes.
Examples for the kind of data you typically want to compress are e.g.
• text
• source-code
• arbitrary files
• images
• video
• audio data
• speech
Obviously these data are fairly different in terms of data volume, data
structure, intended usage etc.
Introduction
October 12, 2015 5
Representation of data is a combination of information and redundancy.
Information is the portion of data that must be preserved permanently in
its original form in order to correctly interpret the meaning or purpose of the
data.
Redundancy is that portion of data that can be removed when it is not
needed or can be reinserted to interpret the data when needed. Most
often, the redundancy is reinserted in order to regenerate the original data
in its original form.
Introduction
October 12, 2015 6
o Analog data
Also called continuous data
Represented by real numbers (or complex numbers)
o Digital data
Finite set of symbols {a1, a2, ... , am}
All data represented as sequences (strings) in the symbol set.
Example: {a,b,c,d,r} abracadabra
Digital data can be an approximation to analog data
What is Information
October 12, 2015 7
o Roman alphabet plus punctuation
o ASCII - 256 symbols
o Binary - {0,1}
0 and 1 are called bits
All digital information can be represented efficiently in binary
{a,b,c,d} fixed length representation
2 bits per symbol
Symbols
October 12, 2015 8
Data compression is essentially a redundancy reduction technique.
Data compression is the art of reducing the number of bits needed to
store or transmit data.
Data compression is the process of converting an input data stream
(the source stream or the original raw data) into another data stream (the
output, the bit-stream, or the compressed stream) that has a smaller
size.
Define Data Compression
October 12, 2015 9
• 1st century B.C.: Steganography
• 19th century: Morse- and Braille alphabets
• 50ies of the 20th century: compression technologies exploiting statistical
redundancy are developed – bit-patterns with varying length are used to
represent individual symbols according to their relative frequency.
• 70ies: dictionary algorithms are developed – symbol sequences are
mapped to shorter indices using dictionaries.
• 70ies: with the ongoing digitization of telephone lines telecommunication
companies got interested in procedures how to get more channels on a
single wire.
• early 80ies: fax transmission over analog telephone lines.
• 80ies: first applications involving digital images appear on the market,
the “digital revolution” starts with compressing audio data
• 90ies: video broadcasting, video on demand, etc.
History of compression technologies
October 12, 2015 10
The reason we need data compression is that more and more of the
information that we generate and use is in digital form—consisting of
numbers represented by bytes of data.
Compression Technology is employed to efficiently use storage space.
To save on transmission capacity.
To save on transmission time.
Basically, its all about saving resources and money.
Reduce computation
Why do we still need compression ?
October 12, 2015 11
Compression is enabled by statistical and other properties of most data types,
however, data types exist which cannot be compressed, e.g. various kinds of noise
or encrypted data. Compression-enabling properties are:
• Statistical redundancy: in non-compressed data, all symbols are represented
with the same number of bits independent of their relative frequency (fixed
length representation).
• Correlation: adjacent data samples tend to be equal or similar (e.g. think of
images or video data).
There are different types of correlation:
– Positive correlation
– Negative correlation
– Perfect correlation
In addition, in many data types there is a significant amount of irrelevancy since
the human brain is not able to process and/or perceive the entire amount of data.
As a consequence, such data can be omitted without degrading perception.
Furthermore, some data contain more abstract properties which are independent
of time, location, and resolution and can be described very efficiently (e.g. fractal
properties).
Why is it possible to compress data ?
October 12, 2015 12
• A digital compression system requires two algorithms: Compression of data
at the source (encoding), and decompression at the destination (decoding).
• For stored multimedia data compression is usually done once at storage
time at the server and decoded upon viewing in real time.
October 12, 2015 [email protected] 13
Data Compression Methods
Data compression is about storing and sending a smaller number of
bits.
There’re two major categories for methods to compress data:
lossless and lossy methods
October 12, 2015 14
lossless vs. lossy compression
Used for compressing images and video files (our eyes cannot
distinguish subtle changes, so lossy data is acceptable).
These methods are cheaper, less time and space.
October 12, 2015 15
Lossless compression techniques, as their name implies, involve no loss of
information. If data have been losslessly compressed, the original data can be
recovered exactly from the compressed data. Lossless compression is generally
used for applications that cannot tolerate any difference between the original and
reconstructed data.
In lossless methods, original data and the data after compression and
decompression are exactly the same.
Redundant data is removed in compression and added during
decompression.
Lossless methods are used when we can’t afford to lose any data: legal
and medical documents, computer programs.
October 12, 2015 16
• Lossless compression 𝑋 = 𝑋
– Also called entropy coding, reversible coding.
• Lossy compression 𝑋 ≠ 𝑋
– Also called irreversible coding.
October 12, 2015 17
Compression performance
A very logical way of measuring how well a compression algorithm
compresses a given set of data is to look at the ratio of the number of bits
required to represent the data before compression to the number of bits
required to represent the data after compression.
Another way of reporting compression performance is to provide the
average number of bits required to represent a single sample. This is
generally referred to as the rate.
Compression ratio = size of the output streamsize of the input stream
Compression factor = size of the input stream
size of the output stream
October 12, 2015 [email protected] 18
Speed: When evaluating data compression algorithms, speed is always in terms of
uncompressed data handled per second.
For streaming audio and video,
Energy:
There has been little research done on the amount of energy used by compression
algorithms.
In some sensor networks, the purpose of compression is to save energy. By spending
a little energy in the CPU compressing the data, so we have fewer bytes to transmit,
we save energy in the radio -- the radio can be turned on less often, or for shorter
periods of time, or both.[
October 12, 2015 [email protected] 19
Latency
Latency refers to a short period of delay (usually measured in milliseconds)
between when an audio signal enters and when it emerges from a system.
Compression adds 2 kinds of latency: compression latency and decompression
latency, both of which add to end-to-end latency.
Space: some times programmer need to know how much RAM does the
algorithm need to run?
October 12, 2015 20
consists of groups (or cells) of 3 × 2 dots each, embossed on thick paper.
Each of the 6 dots in a group may be flat or raised, implying that the
information content of a group is equivalent to 6 bits, resulting in 64 possible
groups.
Braille code
Intuitive Compression
October 12, 2015 21
Irreversible Text Compression
Sometimes it is acceptable to “compress” text by simply throwing away
some information.
This is called irreversible text compression or compaction. The
decompressed text will not be identical to the original, so such methods
are not general purpose; they can only be used in special cases.
October 12, 2015 22
Ad Hoc Text Compression
Here are some simple, intuitive ideas for cases where the compression
must be reversible (lossless).
If the text contains many spaces but they are not clustered, they may be
removed and their positions indicated by a bit-string that contains a 0 for
each text character that is not a space and a 1 for each space. Thus, the
text
Here are some ideas,
is encoded as the bit-string
“0000100010000100000”
followed by the text
Herearesomeideas.
October 12, 2015 23
Packing
Since ASCII codes are essentially 7 bits long, the text may be compressed by
writing 7 bits per character instead of 8 on the output stream. This may be called
packing. The compression ratio is, of course, 7/8 = 0.875.
Dictionary data
(or any list sorted lexicographically) can be compressed using the concept of front
compression. This is based on the observation that adjacent words in such a list
tend to share some of their initial characters. A word can therefore be compressed
by dropping the n characters it shares with its predecessor in the list and replacing
them with n. a a
aardvark 1ardvark
aback 1back
abaft 3ft
abandon 3ndon
abandoning 7ing
abasement 3sement
abandonment 3ndonment
abash 3sh
abated 3ted
abate 5
October 12, 2015 24
• Suppose we have n symbols. How many bits (as a function of n ) are
needed in to represent a symbol in binary?
– First try n a power of 2.
How many Bits Per Symbol?
Discussion: Non-Powers of Two
• Can we do better than a fixed length representation for non-powers of
two?
October 12, 2015 25
• Developed by Shannon in the 1940’s and 50’s.
• Attempts to explain the limits of communication using probability theory.
Information Theory
Information Theory uses the term entropy as a measure of how much
information is encoded in a message. The word entropy was borrowed from
thermodynamics, and it has a similar meaning. The higher the entropy of a
message, the more information it contains. The entropy of a symbol is
defined as the negative logarithm of its probability. To determine the
information content of a message in bits, we express the entropy using the
base 2 (may be (e) or 10) logarithm:
October 12, 2015 26
The entropy of the message (flow of information) is its amount of
uncertainty; it increases when the message is closer to random, and
decreases when it is less random.
We define the entropy of a random variable X, taking values in the
alphabet X as
• The base 2 logarithm measures the entropy in bits. The intuition is
that entropy describes the “compressibility” of the source.
• H is the average number of bits required to code up a symbol,
given all we know is the probability distribution of the symbols.
• H is the Shannon lower bound on the average number of bits to
code a symbol in this “source model”.
Entropy
October 12, 2015 [email protected] 27
The entropy H(Fi) of any particular letter Fi in a file F is
(This is the number of bits required to represent that letter using an entropy coder)
And the entropy H(F) of the entire file is the sum of the entropy of each letter in the file,
(the number of bits required to represent the entire file is the sum of the number of bits
required to represent each letter in that file)
Entropy is a measure of unpredictability. Understanding entropy not only helps you
understand data compression, but can also help you choose good passwords and
avoid easily-guessed passwords.
October 12, 2015 [email protected] 28
In terms of the number of unique possible messages n (any particular letter in
the file is one of a list n possible letters, from x1,..xn , any of which may occur
0, 1, or possibly N times)
October 12, 2015 29
Example 1. Let X ∼ {1, , 16}. Note we need 4 bits to represent the values
of X intuitively. The entropy is
Example 2. 8 horses in a race with winning probabilities
Note
Example 3. {a, b, c} with P(a) = 1/8, P(b) = 1/4, P(c) = 5/8
– inf (a) = log2(8) = 3
– inf (b) = log2(4) = 2
– inf (c) = log2(8/5) = .678
• Receiving an “a” has more information than receiving a “b” or “c”.
October 12, 2015 30
Theorem: (Source Code Theorem) Roughly speaking, H(X) is the
minimum rate at which we can compress a random variable X and
recover it fully.
Data with low entropy permit a larger compression ratio than data with
high entropy.
• Consider the message: HELLO WORLD!
– The letter L has a probability of 3/12 = 1/4 of appearing in this message. The
number of bits required to encode this symbol is -log2(1/4) = 2.
• Using our formula, -P(x) log2P(xi), the average entropy of the entire message is
3.022.
– This means that the theoretical minimum number of bits per character is 3.022.
• Theoretically, the message could be sent using only 37 bits. (3.022 12 = 36.26)
October 12, 2015 31 [email protected]