+ All Categories
Home > Documents > Data Compersion

Data Compersion

Date post: 03-Jun-2018
Category:
Upload: smile00972
View: 221 times
Download: 0 times
Share this document with a friend

of 11

Transcript
  • 8/12/2019 Data Compersion

    1/11

    Data Compersion:With the increased emphasis on full-text data bases, the problem of handling

    the quantity of data becomes significant. Since the time required to search a

    database is heavily dependent on the amount of data, for efficient operation of

    an information system is necessary both to organize the data well and to find

    as efficient a representation for the data as is possible. Thus there is growing

    interest in data compersion. Why needed?

    1. Size of applications is going from large to larger MP3, MPEG, Tiff, etc.

    2. Fax has about 4 million dots/page more than 1 minutes over 56Kbps.

    If the data is compressed by a factor of 10, the transmission time is

    reduced to 6 seconds per page.

    3. TV / Motion Pictures uses 30 pictures (frames) / second 200,000 pixels /

    frames, color pictures require 3 bytes for each pixel (RGB). Each frame

    has 200,000 * 24 = 4.8 Mbits, 2-hour movie requires 216,000 pictures.

    total bits for such movie = 216,000 * 4.8 Mbits = 1.0368 x 1012. This is

    much higher than the capacity of DVDs

    Without c ompression, these appl icat ions w ould not be feasib le.

    A codec is called LOSSY, if the data is lost during compression, while it called

    LOSSLESS, if the data is not loss during compression.

    1.Redundancy reduction (Usually lossless):

    Remove redundancy from the message.

    2. Reduce information content (Usually loosy):

    Reduce the total amount of information in the message.

    Leads to sacrifice of quality.

  • 8/12/2019 Data Compersion

    2/11

    Two classes of text compression methods

    Symbol-wise (or statistical) methods

    Estimate probabilities of symbols - modeling step

    Usually based on either arithmetic or Huffman coding

    Dictionary methods

    Replace fragments of text with a single code word(typically an index to an entry in the dictionary).

    eg: Ziv-Lempel coding, which replaces strings of

    characters with a pointer to a previousoccurrence of the string.

    No probability estimates needed

    Text Compression

    model

    encoder

    model

    decodercompressed texttext text

  • 8/12/2019 Data Compersion

    3/11

    Information TheoryEntropy:Shannon borrowed the definition of entropy from statistical physics

    to capture the notion of how much information is contained in the whole

    alphabet. For a set of possible messages S, Shannon defined entropy as,

    Where p(s) is the probability of message s. The self information i(s) represents

    the number of bits of information contained in it, and roughly speaking the

    number of bits we should use to send that message.

    sispsp

    spSHSsSs

    .1

    log 2

    average original symbol length

    average compressed symbol length

    C

    25.2125.0

    1logx0.125x2

    25.0

    1logx0.25x3

    0.1250.125,0.25,0.25,,25.0

    22

    sH

    sP

    Redundance:is the average codeword legths minus the entropy.

    Comp ersion rat io:is the ratio between the average number of bit/symbol in

    the original message and the same quantity for the coded message.

  • 8/12/2019 Data Compersion

    4/11

    Based on the assumption that a file has a great deal of redundancy. Data is

    considered just a string of symbols. RLE is good for fax and voice.

    22 characters 14 characters

    ABBCCDDDDDDDDDEEFGGGGG => ABBCCD#9EEFG#5(22-14)/22 = 36 % reduction

    Disadvantage:1. We are unable to distinguish compressed text in the file from

    uncompressed text.

    2. Any numeric value will be interpreted as the beginning of a

    compressed sequence.

    1:Run Length Encoding RLE)

  • 8/12/2019 Data Compersion

    5/11

    1. Intially each symbole is considered as a separate

    binary tree.

    2. Two tree with the lowest frequencies are chosen andcombined into a single tree whose assigned frequency

    is the sum of the two given frequencies. The chosen

    tree form the two branches of the new tree.

    3. The process is repeated until only a single tree

    remains. Then the two branches for every are labeled 0and 1 (0 on the left branch, but the order is not

    important).

    4. The code for each symbole can be read by following

    the branch from the root to the symbol.

    There is another algorithm which performances are slightly

    better than Run Length Ecoding, the famous Huffman coding.

    Huffman code is the frequency distribution of the symboles tobe encoded. A binary tree is then constructed.

    2: Huffman coding

  • 8/12/2019 Data Compersion

    6/11

    Huffman coding - Example

    0

    a0.05

    b0.05

    c0.1

    d0.2

    e0.3

    f0.2

    g0.1

    0.1

    0.2

    0.3

    0.4

    0.6

    1.00

    0

    0

    0

    0

    1

    1

    1

    1

    1

    1

    a0.05

    b0.05

    c0.1

    d0.2

    e0.3

    f0.2

    g0.1

    0.1

    0.2

    0.3

    0.4

    0.6

    1.0

    Symbol Prob. Codeword

    0.05 0000

    0.05 0001

    0.1 001

    0.2 01

    0.3 10

    0.2 11

    a

    b

    c

    d

    e

    f 0

    0.1 111g

  • 8/12/2019 Data Compersion

    7/11

    Code the sequence (aeebcddegfced) andevaluate entropy and compression ratio.

    Sol: 0000 10 10 0001 001 01 01 10 111 110 001 10 01

    Aver. orig. symb. length = 3 bitsAver. compr. symb. length = 34/13

    Symbol Prob. Codeword

    0.05 0000

    0.05 0001

    0.1 001

    0.2 01

    0.3 10

    0.2 11

    a

    b

    c

    d

    e

    f 0

    0.1 111g

    Huffman coding - Exercise

    H(X) = 2.5464 bits

    Huffman coding - Notes

    1. In the huffman coding, if, at any time, there is more than one way to

    choose a smallest pair of probabilities, any such pair may be chosen.

    2. Huffman code is a variable-length code, with the more frequent symbols

    being assigned shorter codes.

    3. Huffman codes are good for data messages.

  • 8/12/2019 Data Compersion

    8/11

    LZ77 keep track of last n bytes of data seen and when a phrase is encountered

    that has already been seen, they output a pair of values corresponding to the

    position of the phrase in the previously-seen buffer of data, and the length of

    the phrase. The code consists of a set of triples < a, b, c >, where:a = relative position of the longest match in the dictionary

    b = length of longest match

    c = next char in buffer beyond longest match

    The beginning with 0 identify new characters, not previously seen.

    Lempel-Ziv Compression LZ77):

    P e t e r _ P i p e r _ p i c (0,0,P)

    (0,0,t)

    (2,1,r)

    (0,0,_)

    Output

    Code

    P e t e r _ P i p e r _ p i c (0,0,e)

    P e t e r _ P i p e r _ p i c

    P e t e r _ P i p e r _ p i c

    P e t e r _ P i p e r _ p i c

    1

    2

    3

    4

    5

    No. of code

    triples

    k

    k

    k

    k

    k

    Decodedtext

  • 8/12/2019 Data Compersion

    9/11

    P e t e r _ P i p e r _ (6,1,i)

    (6,3,c)

    (0,0,k)

    P e t e r _ P i p e r (8,2,r)

    P e t e r _ P i p e r

    P e t e r _ P i p e r

    6

    7

    8

    9

    _

    _

    _

    p i c

    p i c

    p i c

    p i c

    k

    k

    k

    k

    Output

    Code

    No. of code

    triples Decodedtext

  • 8/12/2019 Data Compersion

    10/11

    Arithmetic coding is based on the concept of in terval subd iv id ing.

    In arithmetic coding a source ensemble is represented by an

    interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. It uses the

    probabilities of the source messages to successively narrow

    the interval used to represent the ensemble.

    Arithmetic Coding:

    Arithmetic Coding: Description

    In the following discussions, we will use M as the size of the

    alphabet of the data source,

    N[x] as symbol x's probability,

    Q[x] as symbol x's cumulative probability (Q[i]=N[0]+N[1]+.)

    Assume we know the probabilities of each symbol, we can allocate to each symbol an interval with width proportional to

    its probability, and each of the intervals does not overlap with others.

    This can be done if we use the cumulative probabilities as the two

    ends of each interval. Therefore, the two ends of each symbol x

    amount to Q[x-1] and Q[x].

    Symbol x is said to own the range [Q[x-1], Q[x]).

  • 8/12/2019 Data Compersion

    11/11

    Arithmetic Coding: Encoder exampleSymbol, x Probability,

    N[x]

    [Q[x-1], Q[x])

    A 0.4 0.0, 0.4

    B 0.3 0.4, 0.7

    C 0.2 0.7, 0.9

    D 0.1 0.9, 1.0

    1

    0

    B

    0.4

    0.7 0.67

    0.61

    C

    0.634

    0.61

    A

    0.6286

    0.6196

    B String: BCAB

    Code sent:0.61960.52

    0.61

    0.67

    0.634

    0.652

    0.664

    0.6196

    0.6268


Recommended