+ All Categories
Home > Documents > 16.3 Huffman codeselomaa/teach/AADS-2016-8.pdf · 16.3 Huffman codes • Huffman codes compress...

16.3 Huffman codeselomaa/teach/AADS-2016-8.pdf · 16.3 Huffman codes • Huffman codes compress...

Date post: 30-Mar-2020
Category:
Upload: others
View: 22 times
Download: 0 times
Share this document with a friend
24
10/26/2016 1 16.3 Huffman codes Huffman codes compress data very effectively – savings of 20% to 90% are typical, depending on the characteristics of the data being compressed We consider the data to be a sequence of characters Huffman’s greedy algorithm uses a table giving how often each character occurs (i.e., its frequency) to build up an optimal way of representing each character as a binary string 26-Oct-16 MAT-72006 AADS, Fall 2016 383 We have a 100,000-character data le that we wish to store compactly We observe that the characters in the le occur with the frequencies given in the table above That is, only 6 different characters appear, and the character occurs 45,000 times Here, we consider the problem of designing a binary character code (or code for short) in which each character is represented by a unique binary string, which we call a codeword 26-Oct-16 MAT-72006 AADS, Fall 2016 384
Transcript

10/26/2016

1

16.3 Huffman codes

• Huffman codes compress data very effectively– savings of 20% to 90% are typical, depending on

the characteristics of the data being compressed• We consider the data to be a sequence of

characters• Huffman’s greedy algorithm uses a table giving

how often each character occurs (i.e., itsfrequency) to build up an optimal way ofrepresenting each character as a binary string

26-Oct-16MAT-72006 AADS, Fall 2016 383

• We have a 100,000-character data le that wewish to store compactly

• We observe that the characters in the le occurwith the frequencies given in the table above

• That is, only 6 different characters appear, andthe character occurs 45,000 times

• Here, we consider the problem of designing abinary character code (or code for short) inwhich each character is represented by a uniquebinary string, which we call a codeword

26-Oct-16MAT-72006 AADS, Fall 2016 384

10/26/2016

2

• Using a xed-length code, requires 3 bits torepresent 6 characters:

= 000, = 001, … , = 101• We now need 300,000 bits to code the entire le• A variable-length code gives frequent

characters short codewords and infrequentcharacters long codewords

• Here the 1-bit string 0 represents , and the 4-bitstring 1100 represents

• This code requires (45 1 + 13 3 + 12 3 + 163 + 9 4 + 5 4) 1,000 = 224,000 bits (savings25%)

26-Oct-16MAT-72006 AADS, Fall 2016 385

Pre x codes

• We consider only codes in which no codeword isalso a pre x of some other codeword

• A pre x code can always achieve the optimal datacompression among any character code, and so wecan restrict our attention to pre x codes

• Encoding is always simple for any binary charactercode; we just concatenate the codewordsrepresenting each character of the le

• E.g., with the variable-length pre x code, we codethe 3-character le as 0 101 100 = 0101100

26-Oct-16MAT-72006 AADS, Fall 2016 386

10/26/2016

3

• Pre x codes simplify decoding– No codeword is a pre x of any other, the

codeword that begins an encoded le isunambiguous

• We can simply identify the initial codeword,translate it back to the original character, andrepeat the decoding process on the remainder ofthe encoded le

• In our example, the string 001011101 parsesuniquely as 0 0 101 1101, which decodes to

26-Oct-16MAT-72006 AADS, Fall 2016 387

• The decoding process needs a convenientrepresentation for the pre x code so that we caneasily pick off the initial codeword

• A binary tree whose leaves are the givencharacters provides one such representation

• We interpret the binary codeword for a characteras the simple path from the root to thatcharacter, where 0 means “go to the left child”and 1 means “go to the right child”

• Note that the trees are not BSTs — the leavesneed not appear in sorted order and internalnodes do not contain character keys

26-Oct-16MAT-72006 AADS, Fall 2016 388

10/26/2016

4

26-Oct-16MAT-72006 AADS, Fall 2016 389

The trees corresponding to the xed-lengthcode = 000, … , = 101 and the optimalpre x code = 0, = 101, … , = 1100

• An optimal code for a le is always represented by afull binary tree, in which every nonleaf node has twochildren

• The xed-length code in our example is not optimalsince its tree is not a full binary tree: it containscodewords beginning 10 …, but none beginning 11 …

• Since we can now restrict our attention to full binarytrees, we can say that if is the alphabet fromwhich the characters are drawn and– all character frequencies are positive, then– the tree for an optimal pre x code has exactly | |

leaves, one for each letter of the alphabet, and– exactly 1 internal nodes

26-Oct-16MAT-72006 AADS, Fall 2016 390

10/26/2016

5

• Given a tree corresponding to a pre x code,we can easily compute the number of bitsrequired to encode a le

• For each character in the alphabet , let theattribute . denote the frequency of andlet ( ) denote the depth of ’s leaf

• ( ) is also the length of the codeword for• Number of bits required to encode a le is thus

= . ( )

which we de ne as the cost of the tree

26-Oct-16MAT-72006 AADS, Fall 2016 391

Constructing a Huffman code

• Let be a set of characters and each characterbe an object with an attribute .

• The algorithm builds the tree corresponding to theoptimal code bottom-up

• It begins with | | leaves and performs 1“merging” operations to create the nal tree

• We use a min-priority queue , keyed on , toidentify the two least-frequent objects to merge

• The result is a new object whose frequency is thesum of the frequencies of the two objects

26-Oct-16MAT-72006 AADS, Fall 2016 392

10/26/2016

6

HUFFMAN( )1. | |2.3. for 1 to 14. allocate a new node5. . EXTRACT-MIN( )6. . EXTRACT-MIN( )7. . . + .8. INSERT( , )9. return EXTRACT-MIN( ) // return the root of the tree

26-Oct-16MAT-72006 AADS, Fall 2016 393

26-Oct-16MAT-72006 AADS, Fall 2016 394

10/26/2016

7

• To analyze the running time of HUFFMAN, letbe implemented as a binary min-heap

• For a set of characters, we can initialize(line 2) in ( ) time using the BUILD-MIN-HEAP

• The for loop executes exactly 1 times, andsince each heap operation requires time (lg ),the loop contributes ( lg ), to the running time

• Thus, the total running time of HUFFMAN on a setof characters is ( lg )

• We can reduce the running time to ( lg lg ) byreplacing the binary min-heap with a van EmdeBoas tree

26-Oct-16MAT-72006 AADS, Fall 2016 395

Correctness of Huffman’s algorithm

• We show that the problem of determining anoptimal pre x code exhibits the greedy-choiceand optimal-substructure properties

Lemma 16.2 Let be an alphabet in which eachcharacter has frequency . . Let andbe two characters in having the lowestfrequencies. Then there exists an optimal pre xcode for in which the codewords for andhave the same length and differ only in the last bit.

26-Oct-16MAT-72006 AADS, Fall 2016 396

10/26/2016

8

Lemma 16.3Let , . , , and be as in Lemma 16.2.Let = , . De ne for as for

, except that . = . + . .Let be any tree representing an optimal pre xcode for the alphabet . Then the tree , obtainedfrom by replacing the leaf node for with aninternal node having and as children,represents an optimal pre x code for .

Theorem 16.4 Procedure HUFFMAN produces anoptimal pre x code.

26-Oct-16MAT-72006 AADS, Fall 2016 397

17 Amortized Analysis

• We average the time required to perform asequence of data-structure operations over allthe operations performed

• Thus, we can show that the average cost of anoperation is small, even if a single operationwithin the sequence might be expensive

• Amortized analysis differs from average-caseanalysis in that probability is not involved– Amortized analysis guarantees the average

performance of each operation in the worst case26-Oct-16MAT-72006 AADS, Fall 2016 398

10/26/2016

9

17.1 Aggregate analysis

• We show that for all , a sequence ofoperations takes worst-case time ( ) in total

• In the worst case, average cost, or amortizedcost, per operation is therefore ( )/

• This amortized cost applies to each operation,even when there are several types of operationsin the sequence

• The other two methods we shall study, mayassign different amortized costs to differenttypes of operations

26-Oct-16MAT-72006 AADS, Fall 2016 399

Stack operations

• The fundamental stack operations PUSH( , )and POP( ) each takes (1) time

• Let’s consider the cost of each operation to be 1• The total cost of a sequence of PUSH and POP

operations is therefore , and the actual runningtime for operations is therefore ( )

• Now we add the stack operation MULTIPOP( , ),which removes the top objects of stack ,popping the entire stack if the stack containsfewer than objects

26-Oct-16MAT-72006 AADS, Fall 2016 400

10/26/2016

10

• The operation STACK-EMPTY returns TRUE if thereare no objects currently on the stack, and FALSEotherwise

MULTIPOP( , )1. while not STACK-EMPTY( ) and > 02. POP

3. 1

• The total cost of MULTIPOP is min( , ), and theactual running time is a linear function of this

26-Oct-16MAT-72006 AADS, Fall 2016 401

• Let us analyze a sequence of PUSH, POP, andMULTIPOP operations on an initially empty stack

• The worst-case cost of a MULTIPOP operation is( ), since the stack size is at most

• The worst-case time of any stack operation is( ), and hence a sequence of operations

costs ( )• This analysis is correct, but the result is

not tight• Using aggregate analysis, we can obtain a

better upper bound that considers the entiresequence of operations

26-Oct-16MAT-72006 AADS, Fall 2016 402

10/26/2016

11

• We can pop each object from the stack at mostonce for each time we have pushed it onto thestack

• The number of times that POP can be called on anonempty stack, including calls within MULTIPOP,is at most the number of PUSH operations, whichis at most

• Any sequence of PUSH, POP, and MULTIPOPoperations takes a total of ( ) time

• The average cost of an operation ( ) = (1)

26-Oct-16MAT-72006 AADS, Fall 2016 403

Incrementing a binary counter

• Consider the problem of implementing a -bitbinary counter that counts upward from 0

• We use an array [0. . 1] of bits, where. = , as the counter

• A binary number that is stored in the counterhas its lowest-order bit in [0] and its highest-order bit in [ 1], so that

= [ ] 2

26-Oct-16MAT-72006 AADS, Fall 2016 404

10/26/2016

12

• Initially, = 0: = 0 for = 0,1, … , 1• To add 1 (modulo 2 ) to the value in the counter,

we use the following procedure

INCREMENT( )1. 02. while < . and = 13. 04. + 15. if < .6. 1

26-Oct-16MAT-72006 AADS, Fall 2016 405

26-Oct-16MAT-72006 AADS, Fall 2016 406

10/26/2016

13

• At the start of each iteration of the while loop(lines 2–4), we wish to add a 1 into position

• If = 1, then adding 1 ips the bit to 0 inposition and yields a carry of 1, to be addedinto position + 1 on the next iteration of theloop

• Otherwise, the loop ends, and then, if < , weknow that = 0, so that line 6 adds a 1 intoposition , ipping the 0 to a 1

• The cost of each INCREMENT operation is linearin the number of bits ipped

26-Oct-16MAT-72006 AADS, Fall 2016 407

• A cursory analysis yields a bound that is correctbut not tight

• Single execution of INCREMENT takes time ( )in the worst case, when array contains all 1s

• Thus, a sequence of INCREMENT operations onan initially zero counter takes time ( )

• Tighten the analysis to yield a worst-case cost of( ) by observing that not all bits ip each time

INCREMENT is called• [0] does ip each time INCREMENT is called

26-Oct-16MAT-72006 AADS, Fall 2016 408

10/26/2016

14

• The next bit up, [1], ips only every other time– a sequence of INCREMENT operations on zero

counter causes [1] to ip 2 times• Similarly, bit [2] ips only every fourth time, or

4 times in a sequence of INCREMENToperations

• In general, bit [ ] ips 2 times in asequence of INCREMENT operations on aninitially zero counter

• For , bit [ ] does not exist, and so itcannot ip

26-Oct-16MAT-72006 AADS, Fall 2016 409

• The total number of ips in the sequence is thus

2<

12

=

by infinite decreasing geometric series= 1 (1 ), where | | < 1

• Worst-case time for a sequence of INCREMENToperations on an initially zero counter istherefore ( )

• The average cost of each operation, andtherefore the amortized cost per operation,is ( ) = (1)

26-Oct-16MAT-72006 AADS, Fall 2016 410

10/26/2016

15

17.2 The accounting method

• Assign differing charges to different operations,some charged more/less than they actually cost

• When an operation’s amortized cost exceeds itsactual cost, we assign the difference to speci cobjects in the data structure as credit

• Credit can help pay for later operations• We can view the amortized cost of an operation as

being split between its actual cost and credit that iseither deposited or used up

• Amortized cost of operations may differ26-Oct-16MAT-72006 AADS, Fall 2016 411

• We want to show that in the worst case theaverage cost per operation is small by analyzingwith amortized costs– We must ensure that the total amortized cost of a

sequence of operations provides an upper boundon the total actual cost of the sequence

– Moreover, this relationship must hold for allsequences of operations

• Let be the actual cost of the th operation andits amortized cost, require for all -sequences

26-Oct-16MAT-72006 AADS, Fall 2016 412

10/26/2016

16

• Total credit stored in the data structure is thedifference between the total amortized cost andthe total actual cost,

• Total credit associated with the data structuremust be nonnegative at all times

• If we ever were to allow the total credit tobecome negative, then the total amortized costwould not be an upper bound on the total actualcost

• We must take care that the total credit in thedata structure never becomes negative

26-Oct-16MAT-72006 AADS, Fall 2016 413

Stack operations

• Recall the actual costs of operations; PUSH: 1,POP: 1, and MULTIPOP min( , ), where is theargument of MULTIPOP and is the stack size

• Let us assign the following amortized costs:PUSH: 2, POP: 0, and MULTIPOP 0

• The amortized cost of MULTIPOP is a constant,whereas the actual cost is variable

• In general, the amortized costs of the operationsunder consideration may differ from each other,and they may even differ asymptotically

26-Oct-16MAT-72006 AADS, Fall 2016 414

10/26/2016

17

• We can pay for any sequence of stackoperations by charging the amortized costs

• We start with an empty stack• When we push on the stack, we use 1 unit of

cost to pay the actual cost of the push and areleft with a credit of 1

• The unit stored serves as prepayment for thecost of popping it from the stack

• When we execute a POP operation, we chargethe operation nothing and pay its actual costusing the credit stored in the stack

26-Oct-16MAT-72006 AADS, Fall 2016 415

• We can also charge MULTIPOPs nothing• To pop the 1st element, we take 1 unit of cost

from the credit and use it to pay the actual costof a POP operation

• To pop a 2nd element, we again have an unit ofcost in the credit to pay for the POP operation,…

• Thus, we have always charged enough up frontto pay for MULTIPOP operations

• For any sequence of operations, the totalamortized cost is an upper bound on actual cost

• Since the total amortized cost is ( ), so is thetotal actual cost

26-Oct-16MAT-72006 AADS, Fall 2016 416

10/26/2016

18

Incrementing a binary counter

• The running time is proportional to the numberof bits ipped, which we use as our cost

• Let us charge an amortized cost of 2 units to seta bit to 1

• We use 1 unit to pay for the setting of the bit,and we place the other unit on the bit as credit tobe used later when we ip the bit back to 0

• At any point in time, every 1 in the counter has aunit of credit on it, and thus we can chargenothing to reset a bit to 0

26-Oct-16MAT-72006 AADS, Fall 2016 417

The amortized cost of INCREMENT:• The cost of resetting the bits within the while

loop is paid for by the units on the bits that arereset

• The INCREMENT procedure sets at most one bit(line 6) and therefore the amortized cost of anINCREMENT operation is at most 2 units

• The number of 1s in the counter never becomesnegative, and thus the amount of credit staysnonnegative at all times

• For INCREMENT operations, the total amortizedcost is ( ), which bounds the total actual cost

26-Oct-16MAT-72006 AADS, Fall 2016 418

10/26/2016

19

17.3 The potential method

• We represent the prepaid work as “potentialenergy,” or just “potential,” which can bereleased to pay for future operations

• Associate the potential with the data structure asa whole rather than with speci c objects

• We will perform operations, starting with aninitial data structure

• Let be the actual cost of the th operation andthe data structure that results after applying

the th operation to data structure26-Oct-16MAT-72006 AADS, Fall 2016 419

• Potential function maps data structure toa real number ( ), the potential of

• The amortized cost of the th operation withrespect to potential function is de ned by

= ( ( )• The actual cost plus the change in potential• The total amortized cost of the operations is

= ( ( )

= ( ( )

• 2nd equality follows because the terms telescope26-Oct-16MAT-72006 AADS, Fall 2016 420

10/26/2016

20

• If we can de ne a potential function so that( ( ) then the total amortized cost

gives an upper bound on the total actualcost

• In practice, we do not always know how manyoperations might be performed

• Therefore, if we require that ( ( ) forall , then we guarantee, as in the accountingmethod, that we pay in advance

• We usually just de ne ( ) to be 0 and thenshow that ( 0 for all

26-Oct-16MAT-72006 AADS, Fall 2016 421

• If the potential difference ( ( ) of theth operation is …– positive, then the amortized cost represents

an overcharge to the th operation, and thepotential of the data structure increases

– negative, then the amortized cost representsan undercharge to the th operation, and thedecrease in the potential pays for the actualcost of the operation

• Different potential functions may yield differentamortized costs yet still be upper bounds on theactual costs

26-Oct-16MAT-72006 AADS, Fall 2016 422

10/26/2016

21

Stack operations

• We de ne the potential function on a stack tobe the number of objects in the stack

• For the empty stack , we have = 0• Since the number of objects in the stack is never

negative, the stack that results after the thoperation has nonnegative potential, and thus

0 =• The total amortized cost of operations with

respect to therefore represents an upperbound on the actual cost

26-Oct-16MAT-72006 AADS, Fall 2016 423

• If the th operation on a stack containingobjects is a PUSH operation, then the potentialdifference is

= + 1 = 1• The amortized cost of this PUSH operation is

= ( ( ) = 1 + 1 = 2• Suppose that the th operation on the stack is

MULTIPOP( , ), which causes = min( , )objects to be popped off the stack

• The actual cost of the operation is , and thepotential difference is

=: 26-Oct-16MAT-72006 AADS, Fall 2016 424

10/26/2016

22

• Thus, the amortized cost of the MULTIPOPoperation is

= ( ( ) = = 0• Similarly, the amortized cost of an ordinary POP

operation is 0• The amortized cost of each of the three

operations is (1), and thus the total amortizedcost of a sequence of operations is ( )

• Since ( ( ), the total amortized cost ofoperations is an upper bound on the total

actual cost• The worst-case cost of operations is ( )

26-Oct-16MAT-72006 AADS, Fall 2016 425

Incrementing a binary counter

• We de ne the potential of the counter after theth INCREMENT to be , the number of 1s in the

counter after the th operation• Suppose that the th INCREMENT operation resets

bits• The actual cost of the operation is therefore at

most + 1, since in addition to resetting bits,it sets at most one bit to 1

• If = 0, then the th operation resets all bits

26-Oct-16MAT-72006 AADS, Fall 2016 426

10/26/2016

23

• In this situation = =• If > 0, then = + 1• In either case, + 1, and the

potential difference is( ( ( + 1) = 1

• The amortized cost is therefore= ( ( )

+ 1 + = 2• If the counter starts at zero, then ( ) = 0• ( 0 for all , the total amortized cost of

INCREMENTs is an upper bound on the totalactual cost, and so the worst-case cost is ( )

26-Oct-16MAT-72006 AADS, Fall 2016 427

• The potential method gives us a way to analyzethe counter even when it does not start at zero

• The counter starts with 1s, and afterINCREMENTs it has 1s, where ,

• Rewrite total actual cost in terms of amortizedcost as

= ( ) + ( )

• We have 2 for all

26-Oct-16MAT-72006 AADS, Fall 2016 428

10/26/2016

24

• Since ( ) = and ( ) = , the totalactual cost of INCREMENTs is +

• Note that since , as long as = ( ), thetotal actual cost is ( )

• I.e., if we execute at least ( ) INCREMENToperations, the total actual cost is ( ), nomatter what initial value the counter contains

26-Oct-16MAT-72006 AADS, Fall 2016 429


Recommended