+ All Categories
Home > Education > Huffman analysis

Huffman analysis

Date post: 07-Jan-2017
Category:
Upload: abubakar-sultan
View: 71 times
Download: 0 times
Share this document with a friend
26
Data Structure and Algorithms CS-707
Transcript
Page 1: Huffman analysis

Data Structure and Algorithms

CS-707

Page 2: Huffman analysis

An optimization problem is one in which you want to find, not just a solution, but the best solution

A “greedy algorithm” sometimes works well for optimization problems

A greedy algorithm works in phases. At each phase:◦You take the best you can get right

now, without regard for future consequences

◦You hope that by choosing a local optimum at each step, you will end up at a global optimum 2

Optimization problems

Page 3: Huffman analysis

Suppose you want to count out a certain amount of money, using the fewest possible bills and coins

A greedy algorithm would do this would be:At each step, take the largest possible bill or coin that does not overshoot◦ Example: To make $6.39, you can choose:

a $5 bill a $1 bill, to make $6 a 25¢ coin, to make $6.25 A 10¢ coin, to make $6.35 four 1¢ coins, to make $6.39

For US money, the greedy algorithm always gives the optimum solution

3

Example: Counting money

Page 4: Huffman analysis

In some (fictional) monetary system, “krons” come in 1 kron, 7 kron, and 10 kron coins

Using a greedy algorithm to count out 15 krons, you would get◦ A 10 kron piece◦ Five 1 kron pieces, for a total of 15 krons◦ This requires six coins

A better solution would be to use two 7 kron pieces and one 1 kron piece◦ This only requires three coins

The greedy algorithm results in a solution, but not in an optimal solution

4

A failure of the greedy algorithm

Page 5: Huffman analysis

In general, greedy algorithms have five components:

A candidate set, from which a solution is created A selection function, which chooses the best

candidate to be added to the solution A feasibility function, that is used to determine if

a candidate can be used to contribute to a solution

An objective function, which assigns a value to a solution, or a partial solution, and

A solution function, which will indicate when we have discovered a complete solution

5

Components of Greedy Algo

Page 6: Huffman analysis

Huffman code is a technique for compressing  data. Huffman's greedy algorithm look at the occurrence of each character and it as a binary string in an optimal way.

Huffman Codes

Page 7: Huffman analysis

Suppose we have a data consists of 100,000 characters that we want to compress. The characters in the data occur with following frequencies.

Example

Page 8: Huffman analysis

Consider the problem of designing a "binary character code" in which each character is represented by a unique binary string.

This method require 3000,000 bits to code the entire file. How do we get 3000,000?

Cont….

Page 9: Huffman analysis

In fixed length code, needs 3 bits to represent six(6) characters.

Total number of characters are 45,000 + 13,000 + 12,000 + 16,000 + 9,000 + 5,000 = 1000,000.

Add each character is assigned 3-bit codeword => 3 * 1000,000 = 3000,000 bits.

Fixed Length Code

Page 10: Huffman analysis

Fixed-length code requires 300,000 bits while variable code requires 224,000 bits.

=> Saving of approximately 25%.

Conclusion

Page 11: Huffman analysis

In which no codeword is a prefix of other codeword. The reason prefix codes are desirable is that they simply encoding (compression) and decoding.

Better….??? A variable-length code can do better by giving frequent

characters short codewords and infrequent characters long codewords.

Prefix Codes

Page 12: Huffman analysis

Character 'a' are 45,000        each character 'a' assigned 1 bit codeword.        1 * 45,000 = 45,000 bits.

Prefix Codes Table

Page 13: Huffman analysis

Characters (b, c, d) are 13,000 + 12,000 + 16,000 = 41,000        each character assigned 3 bit codeword        3 * 41,000 = 123,000 bits

Characters (e, f) are 9,000 + 5,000 = 14,000        each character assigned 4 bit codeword.        4 * 14,000 = 56,000 bits.

Implies that the total bits are: 45,000 + 123,000 + 56,000 = 224,000 bits.

Cont……

Page 14: Huffman analysis

Concatenate the codewords representing each characters of the file. Example From variable-length codes

table, we code the3-characterfile abc as:

Encoding

Page 15: Huffman analysis

Since no codeword is a prefix of other, the codeword that begins an encoded file is unambiguous.

To decode (Translate back to the original character), remove it from the encode file and repeatedly parse.

For example in "variable-length codeword" table, the string 001011101 parse uniquely as 0.0.101.1101, which is decode to aabe.

Decoding

Page 16: Huffman analysis

The representation of "decoding process" is binary tree, whose leaves are characters.

We interpret the binary codeword for a character as path from the root to that character, where 0 means "go to the left child" and 1 means "go to the right child".

Note that an optimal code for a file is always represented by a full (complete) binary tree.

Decoding

Page 17: Huffman analysis

A Binary tree that is not full cannot correspond to an optimal prefix code.

Proof: Let T be a binary tree corresponds to prefix code such that T is not full. Then there must exist an internal node, say x, such that x has only one child, y. Construct another binary tree, T`, which has save leaves as T and have same depth as T except for the leaves which are in the subtree rooted at y in T. These leaves will have depth in T`, which implies T cannot correspond to an optimal prefix code.

To obtain T`, simply merge x and y into a single node, z is a child of parent of x (if a parent exists) and z is a parent to any children of y. Then T` has the desired properties: it corresponds to a code on the same alphabet as the code which are obtained, in the subtree rooted at y in T have depth in T` strictly less (by one) than their depth in T.

Theorem

Page 18: Huffman analysis

Fixed-length code is not optimal since binary tree is not full. Optimal prefix code because tree is full binary

Cont….

Page 19: Huffman analysis

If C is the alphabet from which characters are drawn, then the tree for an optimal prefix code has exactly |c| leaves (one for each letter) and exactly |c|-1 internal orders.

Given a tree T corresponding to the prefix code, compute the number of bits required to encode a file.

For each character c in C, let f(c) be the frequency of c and let dT(c) denote the depth of c's leaf. Note that dT(c) is also the length of codeword. The number of bits to encode a file is

From now on consider only full binary tree

Page 20: Huffman analysis

B (T) = S f(c) dT(c) which define as the cost of the tree T. For example, the cost of the above tree is B (T) = S f(c) dT(c)

         = 45*1 +13*3 + 12*3 + 16*3 + 9*4 +5*4         = 224

Therefore, the cost of the tree corresponding to the optimal prefix code is 224 (224*1000 = 224000).

Cont…..

Page 21: Huffman analysis

A greedy algorithm that constructs an optimal prefix code called a Huffman code. The algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. It begins with a set of |c| leaves and perform |c|-1 "merging" operations to create the final tree.

Data Structure used: Priority queue = QHuffman (c)n = |c|Q = cfor  i =1  to   n-1    do   z = Allocate-Node ()             x = left[z] = EXTRACT_MIN(Q)             y = right[z] = EXTRACT_MIN(Q)            f[z] = f[x] + f[y]            INSERT (Q, z)return EXTRACT_MIN(Q)

Constructing a Huffman code

Page 22: Huffman analysis

Q implemented as a binary heap. line 2 can be performed by using BUILD-HEAP in O(n) time. FOR loop executed |n| - 1 times and since each heap operation requires O(lg n)

time. => the FOR loop contributes (|n| - 1) O(lg n) => O(n lg n)

Thus the total running time of Huffman on the set of n characters is O(nlg n).

Analysis

Page 23: Huffman analysis

Proof Idea Step 1: Show that this problem satisfies the greedy choice property, that is, if a

greedy choice is made by Huffman's algorithm, an optimal solution remains possible.

Step 2: Show that this problem has an optimal substructure property, that is, an optimal solution to Huffman's algorithm contains optimal solution to subproblems.

Step 3: Conclude correctness of Huffman's algorithm using step 1 and step 2.

Correctness of Huffman Code Algorithm

Page 24: Huffman analysis

  Let c be an alphabet in which each character c has frequency f[c]. Let x and y be two characters in C having the lowest frequencies. Then there exists an optimal prefix code for C in which the codewords for x and y have the same length and differ only in the last bit.

Proof: Let characters b and c are sibling leaves of maximum depth in tree T.

Without loss of generality assume that f[b]  ≥  f[c] and f[x] ≤  f[y]. Since f[x] and f[y] are lowest leaf frequencies in order and f[b] and f[c] are

arbitrary frequencies in order. We have f[x] ≤  f[b] and f[y] ≤  f[c]. Exchange the positions of leaves to get first T` and then T``. By formula,B(t) =  c in C f(c)dT(c), the difference in cost between T and T` is

Greedy Choice Property

Page 25: Huffman analysis

B(T) - B(T`) = f[x]dT(x) + f(b)dT(b) - [f[x]dT(x) + f[b]dT`(b)                     = (f[b] - f[x]) (dT(b) - dT(x))                    = (non-negative)(non-negative)                    ≥ 0

Code

Page 26: Huffman analysis

Scheduling problem using greedy Algo Minimum spanning tree. Collecting Coins

Tasks to do…


Recommended