Trade-offs in fixed-point multiplication algorithms for microprocesors

Trade-offs in fixed-point multiplicationalgorithms for microprocessors

A.C. Davies

Indexing term: Digital arithmetic

Abstract: Multiplication algorithms which are appropriate for 8-bit microprocessors are discussed, particularlywithin the context of signal-processing needs, where minimisation of execution time is important. Thetrade-offs which exist between minimum memory, minimum execution time and minimum cost are discussedin detail, and specific examples based on the Z80 processor are included to provide a quantitative comparisonof the alternatives.

1 Introduction

The following Sections present a detailed survey of alterna-tive methods of implementing fixed-point multiplicationfor a microprocessor, and emphasise the trade-offs whichexist between memory requirement and execution time. Tobe specific and to provide a quantitative indication ofexecution times and memory requirements, all examples arebased on the Z80 processor, which is widely used and isone of the faster 8-bit processors. General familiarity withthe Z80 registers and instruction set is assumed. Theinstruction set mnemonics are rather more self explanatorythan those of many other microprocessors, which shouldmake the examples easier to follow.

Multiplication is considered, particularly within thecontext of linear signal-processing operations (such asconvolution, correlation and digital filtering) whichtypically involve many multiplications of signal variables(data) by constants. The requirement is thus to calculatey = ax, where y and x are variables and a is a constant coef-ficient. 8-bit precision is assumed for a and x, so thaty is a16-bit quantity (which may or may not be subsequentlytruncated to 8-bits, depending upon the application).

Presently available 8-bit w-m.o.s. microprocessors are tooslow for most signal-processing applications, but their lowcost and the opportunity to use them in 'intelligent' signal-processing instruments which the user can program onsiteto carry out a variety of different processing functionsmakes them attractive for those applications for which theyare fast enough. The multiplication time is usually thecritical factor, and the results which are given in this papershould make it easier to determine whether a micro-processor is fast enough for a specified application of thiskind.

The machine cycle of the Z80 is of variable length(according to the instruction being executed). Executiontimes are therefore quoted in clock periods (r-states),using the symbol T for these units. Thus, if the processorhas a 4MHz clock, one T-unit is 250 ns, and division by 4gives the time in microseconds. Memory requirements aresimply stated in bytes, denoted by B.

To facilitate comparison of the various algorithmspresented, it is assumed that the multiplicand x is initially

Paper T370C, first received 20th November 1978 and in revisedform 22nd March 1979Dr. Davies is with the Department of Electrical & ElectronicEngineering, The City University, Northampton Square, London,EC1VOHB, England

COMPUTERS AND DIGITAL TECHNIQUES, JUNE 1979, Vol. 2, No. 3

in register D or E, that the multiplier a is (if applicable) inregister C, and that the product y is left in register pair HL(or, where this is inconvenient, in DE).

The contents of any register R will be denoted by <R>,except where no ambiguity could arise from omitting thebrackets, and register transfers denoted by an equals signpreceded by a colon.

Section 2 presents methods for multiplying positiveintegers, Section 3 describes the modifications required forsigned (two's complement) integers, and Section 4 describesthe modifications required for signed fractions.

In addition to multiplication by software, multiplicationwith a hardware multiplier is also considered. The hardwaremultiplier is assumed to be interfaced as four successivememory locations (see, for example, Reference 1 for detailsof how this may be done) and sufficiently fast that theprocessor does not have to wait for the multiplier to beready.

2 Multiplication of positive integers

2.1 Minimum-memory multiplication

Conventional multiplication involves adding a number ofpartial products formed by multiplying shifted versions ofthe multiplicand by successive digits of the multiplier. Inthe binary case, these digits are 0 and 1, so all that isneeded is to add those partial products corresponding tothe T digits of the multiplier. This algorithm can beexpressed as follows:

; multiplicand in DE : = 0H L : = 0B: = 8while (B .ne .0) do shift DE right

shift multiplier left into CARRYif (CARRY. ne.0)HL: = <HD + <DE>B:=<B>-1

end do; product in HL

It would, of course, be equally valid to start with the multi-plicand in E, and shift the multiplier right and DE lefteach time. Converting this into Z80 assembly language givesa multiplication routine essentially the same as one given inReference 2, p. 285, although slightly faster:

105

0140-1335/79/030105 + 08 $01-50/0

XORLDLDLDLDLDSRLRRRLCAJRADDDJNZ

AE,AH,AL,AB,8A,CDE

CLEAR A

CLEAR E, H,COUNTERMULT. TO A

NC, NEXT-SHL.DELOOP-S

1B,4T1B,4T1B.4T

L1B.4T2B.7TIB,4T2B.8T2B.8T1B.4T2B, 12 (7) TIB, 1 IT2B,13(8)T

; MULTIPLIER IN C, MULTIPLICAND IN D, INITIALLY.SETUP

LOOP

NEXT; PRODUCT IN HL, MULTIPLIER IN C, A, MULTIPLACAND in E.

Shifting DE right is costly in both execution time andmemory, since there is no 16-bit shift instruction for DE.There is a left-shift instruction for HL (i.e. ADD HL, HL)which can be used by making use of the equivalencebetween shifting the multiplicand through DE in onedirection and shifting HL in the opposite direction. Theloop then becomes

LOOP

NEXT

ADDRLCAJRADDDJNZ

HL,HL

NC, NEXT-HL,DELOOP-$

After completion of 8 shifts left, the initial contents of Hare lost; H can therefore be used instead for the multiplier,so that the shifting left of HL simultaneously shifts multi-plier digits into the carry flag, making the RLCA instructionunnecessary. This method is given in Reference 3, p. 7-13.

The minimum-memory algorithm is then as follows:

MULTIPLIER IN C, MULTIPLICAND IN E,D,0L,DB,8H,CHL,HLNC, NEXT-SHL.DE

SETUP

LOOP

LDLDLDLDADDJRADD

CLEAR D, LCOUNTERMULT. TO H

NEXT DJNZ LOOPS

INITIALLY.2B,7T1B.4T2B,7T1B,4TIB,1 IT2B, 12(7)TIB,1 IT2B, 13(8)T

reduced, at the expense of increased memory:

SETUP LDD,0LDL,D 4B, 15TLD H, C

Repeat step-i code for i = 0 to 7:Step-i begin: ADDHL,HL )

JP NC,$ + 4 5B,21(32)TStep-i end: ADD HL,DE j

Note that 'repeat' does not mean an iterative software loop,but the inclusion of multiple copies of the code.

The execution time e and memory requirement m arethus given by

e = (15+ 8 x 21 + llw) = (183 + llw)

m = 4 + 8 x 5 = 44

The 3-byte absolute jump instruction (JP) used in place ofthe 2-byte relative jump (JR) reduces the execution timefrom 199 + 6w.

2.3 Multiplication by known multiplier

If the multiplier is known in advance, the conditional jumpcan be eliminated, further reducing the execution time:

SET UP LD D, 0LDL.D 3B,1 IT

PRODUCT IN HL, MULTIPLIER IN C, MULTIPLICAND IN E.

Repeat step-i code for i = 0 t o 7 :Step-i begin: ADDHL,HL IB, 1 ITStep-i end: ADD HL, DE (omit if bit i = 0) I IB, 1 IT

Hence,

e = 11 + 8 11 + l l w = 99 + l l w

m = 3 + 8 + w = 11 + w

Execution times for the conditional jumps depend onwhether the condition is or is not met. The times quoted inparentheses are for the case where it is not met. The execu-tion time depends upon the result of the JR NC conditionalbranch, and hence depends on the Hamming weight w ofthe multiplier. (The Hamming weight of a binary number isdefined simply as the number of ones in it.)

An execution flow diagram is shown in Fig. la, with asimplified execution-time equivalent in Fig. \b\ branch-weights are in T-units, and the integers in square bracketsshow the number of times a particular section is executed.It can be seen that the execution time e is given by

e - (305 + 6w)

and the memory requirement m is given by

m = 12

2.2 Minimum-time multiplication

By opening up the outer loop and eliminating the need forthe counter (B), the execution time can be significantly

Fig. 1 Minimum-memory algorithm

a Execution-flow-diagramb Simplified equivalent of (a) (Unlabelled branches have zero

execution time)

106 COMPUTERS AND DIGITAL TECHNIQUES, JUNE 1979, Vol. 2, No. 3

2.4 Multiplication using a hardware multiplier

If a hardware multiplier is interfaced as memory, such thatlocations MUL, MUL+1 are its inputs and MUL + 2,MUL + 3 are the upper and lower bytes, respectively, ofthe double-length product, the following code is required:

MULTIPLIER IN C, MULTIPLICAND IN ELDLDINCLDINC

HL, MUL(HL), C MULTIPLIER ENTEREDL(HL), E MULTIPLICAND ENTEREDL

WAIT HERE IF MULTIPLIER IS SLOWLDINCLD

D,(HL)LE,(HL)

PRODUCT IN DE

3B,10T1B.7TIB,4T1B.7T1B.7T

1B.7T1B.4T1B.4T

This differs slightly from the software routines because theproduct is left in DE rather than HL, since HL is needed asa memory pointer. (The instruction EX DE, HL could beadded if exact equivalence were required.) It is assumedthat MUL is so chosen that there is no carry from L to H asHL is incremented (so that INC L may be used rather thanthe slower INC HL). To avoid overwriting of the multi-plicand in E, it could initially be in B instead. If the pro-cessor does not have to wait for the multiplier, the execu-tion time and memory requirement is thus given by

e = 50 m = 10

If the multiplier value is known in advance, it is possible tolatch it permanently into the hardware multiplier (in effect,the hardware equivalent of the software of Section 2.3).This is unlikely to be practicable since a separate hardwaremultiplier would be needed for each different multipliervalue, but would enable LD (HL), C and the followingINC L instruction to be eliminated, reducing time andmemory to 39T and 8B, respectively.

2.5 Table-look-up multiplication for known multiplier

For multiplication by a known constant, it is possible tostore in advance all the 256 possible products (so that foreach multiplier value 512 bytes are needed to give a 16-bitproduct). The multiplicand value is used as address (relativeto a base pointer) and directly recovers the stored product.Suppose that the upper (most significant) byte of theproduct is stored in successive memory locations fromBASE, and the lower (least significant) byte stored fromBASE + 256. Multiplication then requires the followinginstructions:

MULTIPLICAND IN ELD H,BASELDLDINCLD

L,ED, (HL)HE,(HL)

PRODUCT IN DE

LOAD UPPER BYTE OF BASE 2B, 7TADD MULTIPLICAND TO OFFSET BASE 1B, 4TGET UPPER BYTE OF PRODUCT 1B, 7T

1B.4TGET LOWER BYTE OF PRODUCT 1B, 7T

shortest execution time is restricted to cases where themultiplier is known in advance (i.e. at assembly time).There are situations where the multiplier may be calculatedduring an earlier phase of program execution (where speedis not critical) for subsequent use in a phase where themultiplication must be as fast as possible. An example is aprogrammable digital filter, in which the user enters a filterspecification, from which the filter coefficients are calcu-lated, after which execution of the speed-critical filteringprogram starts. Under such circumstances, the programcode for a minimum-time multiplication can be generatedautomatically, once the coefficient is known, and copiedinto a section of read-write memory for subsequentexecution. This is easier than it may seem, and, to illustratehow it may be done, suppose that the multiplication codeis to be placed (in r.a.m.) in a storage area called MULT,and that the required code segments are stored (in r.o.m.)in three blocks:

BLOCK 1: LD D, 0L D L . D

BLOCK 2: ADDHL.HL BLOCK 3: ADDHL.DE

An automatic-generation algorithm is then as follows:

copy BLOCK 1 to MULTi : = 0while (i < 8) do copy BLOCK 2 to MULT

if (multiplier bit i = 1) do copy BLOCK 3 to MULTi : = i + 1

end do

If BLOCK 1, BLOCK 2 and BLOCK 3 are stored in sequen-tial locations, a possible implementation in Z80 assembly-language is:

SEQUENCE FOR GENERATING SPEED-OPTIMISED MULTIPLY CODEMULTIPLIER IN ACCUMULATOR.

B,3IX,BLOCK 1IY, MULTH, (IX + 0)(IY + 0), HIXIYFIRSTSB,8IX,BLOCK 2H,(IX + 0)(IY + 0), H

START

FIRST

SECOND

TEST

LDLDLDLDLDINCINCDJNZLDLDLDLDINCINCRLCAJRLDLDINCDJNZ

COPY ONE BYTEVIA H REGISTER

BLOCK 1 COPIED NOW

IXIY

NC,TEST-$H,(IX + 0)(IY + 0), HIYSECOND-S

MOVE MULTIPLIER BIT I INTO CARRY

COPY BLOCK 3

BASE is assumed to be chosen as a multiple of 256 (i.e. thelower byte is zero), so that no carry from L to H can arisefrom adding the multiplicand to BASE.

e - 29 m = 6

This is faster and uses less program memory than the use ofa hardware multiplier for multiplication by a known multi-plier (and would almost certainly be cheaper).

2.6 Automatic generation of minimum-time code

Of the software algorithms, the one which offers the


The table-look-up method (Section 2.5) can also be adaptedfor this situation: after calculating the multiplier, any(slow) multiplication routine can be used to calculate allpossible products and store them in 512 bytes of r.a.m.,ready for use in the subsequent fast-execution phase.

2.7 Ternary coding to minimise execution time

For all the software-multiplication routines described,execution time increases with increase in Hamming weightof the multiplier. If the multiplier can be decomposed intothe difference of two integers, each of much lower Ham-ming weight, execution time can be reduced. For example,suppose that the multiplier is 00111110 (=62). This isthe difference between 01000000 and 00000010 (i.e.64 — 2). Adding one partial product (for 64) and subtract-ing one partial product (for 2) is then required, which is

107

faster than adding five partial products (as required by theoriginal form of the multiplier). This may be regardedas a ternary encoding of the multiplier to the form (0 +0 0 0 0 - 0 ) . 5 ' 6 For the Z80, adding a partial product isachieved by the instruction ADDHL,HL (= 11T). Sub-tracting a partial product requires two instructions:

AND A IB, 4TSBC HL,DE 2B, 15T (total = 19T)

The AND instruction is used simply as a means of clearingthe carry flag before the subtract instruction, since a 16-bitsubtract without carry is not available.

Since w = 5 for 00111110, the execution time using theSection 2.3 code is 15471. The ternary coding methodinvolves the following sequence:

LDLDADDADDADDANDSBCADD

D,0L,DHL,HLHL,DEHL,HLAHL,DEHL,HL

(2 times)

(5 times)

which has an execution time of 129T.

The requirement for the decomposition may be expressedmore formally as follows; Given a multiplier a , decomposeit into the difference

a = b —c

such that / = otw(b) + j3w(c) is minimisedw(b) and w(c) denote the Hamming weights of b and c,

respectively, and a, j3 are constants proportional to theexecution times for adding and subtracting a partial product(e.g. for the Z80 instructions discussed above, a = 11 and0=19).

For the decomposition to be advantageous,/must beless than otw(a). The multiplier must, of course, be knownbefore execution since the required decomposition has tobe worked out in advance. The method for automatic codegeneration described in Section 2.6 can be adapted to takeadvantage of this ternary coding technique by incorporatingan algorithm such as that given in Appendix 8.1, whichdetermines / , b and c for a given value of a.

3 Multiplication of signed (2's complement) integers

3.1 Use of a positive-in teger multiplication rou tine

Given an algorithm for multiplying positive integers, thereare two straightforward ways of modifying it for signedintegers. If al and xl denote the sign bits of multiplier aand multiplicand x, respectively, these are as follows:

Method 1: Change the sign of multiplier and/or multiplicandif they are negative, multiply and if one but not both wasnegative change the sign of the product. Expressed moreformally:

atest: = 0xtest: = 0if(a7= l )do a: = - a

atest: = 1if (x7 = l )do x: = —x

xtest: = 1y: = positive-integer multiplication of a, xif ((atest .xor .xtest) = 1) y: = — y

Method 2: Multiply as if the integers were positive, andcorrect the result (see for example Reference 4, p. 163)

corrected ax = ax — 28JC7 X a - 2*al x x

Expressed formally:

y: = positive-integer multiplication of a, xif (a7 = 1) subtract x form upper byte of yif (x7 = 1) subtract a form upper byte of y

When a or x are not readily available after the positive-integer multiplication, the correction can be done inadvance.

The first method appears to offer no advantages, and sothe second is used in the example in Section 3.2. A thirdalternative is Booth's algorithm,7 described in Section 3.3.

3.2 Signed multiplication by correction method

The Z80 instructions to implement this are

; MULTIPLIER IN C, MULTIPLICAND IN E, INITIALLY.XOR ALD D,A

ALPHA BTT 7, C Jomit if multiplier | o m i t if muWpBer

c L I' 0ViE^Ymwm10 be n e 8» t I W known to be positiveSUB t '7,EZ, TWOSCL, A CORRECTION TO1 L (WILL MOVE TO H)H, C MULT. TO H

; POSITIVE INTEGER MULTIPLICATION ROUTINE TO BE; INSERTED AFTER THIS.

ONE BITJR

OMEGASUBTWO LD

LD

This involves adding 10 bytes to routines such as thosegiven in Sections 2.1 and 2.2 and does not require anyadditional registers except for the accumulator. Fig. 2shows an execution flow diagram for the added instructions(from ALPHA to OMEGA). The added time ea is given by*

ea = 38 + (1 if multiplier positive)

+ (1 if multiplicand positive)

E fc 12

0 0

E < 0

108

Fig. 2 Execution flow diagram for added code for signedmultiplication

The definition of the sgn function has been altered slightly to suitthe application:

sgn(r) — 1 for r > 0, sgn(r) = — 1 for r < 0

(conventionally, sgn(O) should be 0 and not 1)


Therefore

ea = 39 + 0-5 [sgn(C) + sgn(£)]

The total time is therefore given by

e = 222 + 11 w + 0-5[sgn(C) + sgn(£)]

for the minimum-time algorithmand by

e = 344 + 6w + 0-5[sgn(C) + sgn(£)]

for the minimum-memory algorithm.If the multiplier is known in advance, the execution time

and memory requirement depend on whether it is known tobe positive or known to be negative (since SUB E is onlyrequired for the latter case).

For multiplication with a hardware multiplier, there isno change in memory or speed if a 2's complement multi-plier is used.t If a positive-integer multiplier is used,^ thecorrection cannot be completed before the multiplicationsince the multiplier output would overwrite the correction.However, the correction can be left in the accumulator andthe hardware multiplication performed, and then thefollowing instructions executed:

ADDLR

A,DD,A

1B,4T1B,4T

The added time and memory are then given by

ma = 12

giving totals of

e = 97 + 0-5 [sgn(C) + sgn(£)]

m = 22

3.3 Boo th 's algorithm

Booth's algorithm is similar to conventional binary multi-plication, but involves subtraction as well as addition ofpartial products. The 2's complement product is generateddirectly.

Denoting bit / of the multiplier by m(i) and definingm(— 1) as zero, the algorithm may be expressed as follows:; multiplicand in DE:=0HL:=0i :=8while (i.ne.0) do if (m(i - 1) = O.and.m(i - 2) = 1) HL: = < HL> + < DE >

i f (m( i - 1)= l.and.m(i-2) = 0)HL: = <HL)-<DE>

shift DE rightend do

; product in HL

Conversion of this algorithm into assembly language requiresthree conditional branch instructions, and the resultingcode executes more slowly and requires more memory thanthe correction method described in Section 3.2. For com-pleteness, however, an assembly-language version is givenbelow. Shifting DE right has been replaced by shifting HLleft, for the reasons described in Section 2.1. However,when the multiplicand in E is negative, D must be filledwith 'ones' so that DE is a 16-bit negative integer. This hasthe added unwelcome complication that it is no longer

tFor example, the TRW MPY-8 bipolar multiplier, which multipliesin 160 ns.jFor example, the GEC 8807A, an m.o.s. multiplier, which multi-plies in 5 Ms. This particular component is too slow for the Section2.4 code unless some delay is added. However, it costs much lessthan the TRW component.

possible to put the multiplier in H (so that shifting themultiplier and shifting HL can be done in one instruction),because adding D to H would overwrite the multiplier. Inthe code which follows, the multiplier is shifted in theaccumulator:

; SIGNED MULTIPLICATION BY BOOTH'S ALGORITHM; MULTIPLICAND IN E, MULTIPLIER IN C.

ETSETUP

LOOP

LDLDLDLDBITJRLDADDSLA

CLEAR L, DCOUNTER

D, 0L, DB, 8A.C7,EZ, LOOP-SD.OFFH FILL D WITH FF FOR-VEEHL, HLA AFFECTS CY AND SGN FLAGSNC, TRYADD-SM.NEXT

; CARRY MUST BE SET TO GET HERECCFSBC HL, DE

NEXTP, NEXTHL, DELOOP-S

CARRY JP

JPTRYADDJP

ADDNEXT DJNZ S; PRODUCT IN HL, MULTIPLICAND IN D

3.4 Table-look-up multiplication for known signedmultiplier

The table-look-up method for multiplication by a knownmultiplier, described in Section 2.5, could be linked with acorrection method as described in Section 3.2. However,this is unnecessary, since the table may be used to store theproducts of both negative and positive numbers, in whichcase exactly the same instructions are required.

3.5 Quarter-squares table-look-up multiplication

When the multiplier is not known in advance, a direct table-look-up method is not practicable, because the 65 536possible products would have to be stored. However, it ispossible to decompose any product into a difference of twosquares:

(a-xf

By storing all the squares divided by four of the integers ina table, it is possible to calculate the product ax by twotable-look-up operations together with some addition andsubtraction operations. This method was proposed for ahardware-based multiplier in Reference 9.

Initially, it will be assumed that the multiplier a andmultiplicand x are nonnegative 8-bit integers less than 128(so that the most significant bit is zero). The consequencesof removing this restriction will be considered later.

The sum (a + x) will therefore be a nonnegative integerless than 256, so that the table needs 256 locations for themost significant byte and 256 locations for the least signifi-cant byte of each stored quarter square in this range.

The difference (a— x) may be positive or negative,depending on the relative magnitudes of a and x, but, forthe same assumptions, must be less than 128 in magnitude.Provided that, when negative, its sign is changed, the sametable of quarter squares may therefore be used for thisterm. Assuming that the lower bytes are stored from anaddress BASE which is a multiple of 256, and the upperbytes stored from BASE + 256, the whole table occupies

COMPUTERS AND DIGITAL TECHNIQUES, JUNE 1979, Vol. 2, No. 3 109

512 bytes, and multiplication involves the following steps:

H: = Upper byte of BASEL: = a + xE : = « H L »H : = < H > + 1D : = « H L »;(a + x)2/4nowinDEL: = a — xi f ( < L X 0 ) L : = - LB: = «HL»H : = < H > - 1C : = « H L »; (a -x) 2 /4nowinBCHL:=<DE>-<BC>; product in HL

The notation <(HL » denotes the contents of the locationaddressed by HL.

Since the squares of the integers are divided by fourbefore storing in the table, some of the values will not beintegers. Truncating them to integers before storage mightbe expected to lead to errors in the computed products forsome values of multiplier and multiplicand. However, thisturns out not to be the case, for the following reason.

For an even integer, its square is obviously exactly divis-ible by four, and no truncation is required.

For an odd integer, of the form 2n + I, its square is ofthe form 4«2 + 4« + 1, and so the remainder after divisionby 4 is 1.

The maximum truncation error that can occur in evalu-ating either (C+JC) 2 /4 or (a— x)2/4 is therefore onequarter, which is never large enough to make the value of

int| (a + x)2/4| - int| (a -xf /4|

different from the product ax. The notation int| | denotesthe 'integer part of.

Although assumed to be valid only for nonnegativeintegers, the program sequence above also gives the correctproduct for those signed (2's complement) integers forwhich (a + x) is positive.

To see how to overcome this restriction, so that themethod can be used for all 8-bit 2's complement numbers,it is necessary to look carefully at the alternative ways inwhich the most significant bit (bit 7) of (a + x) becomeseither 1 or 0.

If a and x are both positve, bit 7 will be 1 whenever thesum (a + x) is greater than 127.

If a or x (or both) are negative, bit 7 will be 1 wheneverthe sum (a + x) is negative.

In the first case, no change is required. In the secondcase, the sign of (a + x) needs to be changed before address-ing the table.

Similarly, bit 7 can become 0 in two ways, one of whichrequires a sign change.

The various alternatives can be distinguished by testingthe overflow flag as well as the sign flag.

The alteration needed to make the method valid for all2's complement integers is then

L: = a + xif (overflow) then if (L positive) L: = — L

else if (L negative) L: = — Lendif

A similar modification is needed for (a — x).

An assembly-language routine for the quarter-squaresmethod for 2's complement numbers is as follows:

; QUARTER-SQUARES MULTIPLY ROUTINE; MULTIPLIER (A) IN B INITIALLY; MULTIPLICAND (X) IN C INITIALLY; BASEU = UPPER BYTE OF BASE POINTER TO; LOOK-UP TABLE OF QUARTER-SQUARES; LOWER BYTE MUST BE ZERO

BASEUBEGIN

OVFLCHG

EQULDLDADDJPJPJPJPNEC

; OR POS AND OVFLNNG

OVECHANNH

LDLDINCLDLDSUBJPJPJPJPNEGLDLDDECLDEXANDSBC

; RESULT IN HL

24HH, BASEUA,CBPE, OVFLP.NNGCHGM.NNG

L,AE,(HL)HD,(HL)A,CBPE, OVEP.NNHCHAM.NNH

L,AB,(HL)HC, (HL)DE.HLAHL.BC

; JUMP IF OVERFLOW; JUMP IF POSITIVE; NO OVF & NEG

; CHANGE SGN IF NEG & NO O\

; LOAD LO BYTE ADDR

;(X + A)**2/4NOWINDE

; X - A NOW IN ACC

; CHANGE SGN IF NEGATIVE; LOAD HI BYTE ADDR

; (X - A)**2/4 NOW IN BC; (X + A)**2/4 NOW IN HL; CLEAR CARRY FLAG

MULTIPLIER AND MULTIPLICAND LOST

This routine requires 46 bytes, and the execution time doesnot exceed 166 T units for any values of multiplier andmultiplicand. Except for the use of a hardware multiplier,this is therefore the fastest method of those not restrictedto a known multiplier.

4 Fractional multiplication

In digital filtering and similar signal-processing operations,the signal values and multiplier coefficients must be scaledin such a way that overflow does not occur. Normally, theymust not be greater than unity. With 8-bit precision, thisresults in a range from — 1 to + 127/128 (corresponding to10000000 and 01111111, respectively), and the 16-bitproduct of such quantities has a range from — 1 to + 32767/32768. An integer multiplication routine requires anadjustment to be made to allow for this scaling. Consider,for example, the multiplication of 0-5 by 0-5 to give 0-25.In the binary notation this means 01000000 (=64/128)multiplied by itself should give 0010 0000 0000 0000(= 8192/32768). However, if 01000000 is multiplied byitself using an integer-multiplication routine the result is0001 0000 0000 0000 (= 4096/32768), which correspondsto 0-125. It is therefore necessary to shift the final productleft by one to get the correct result. (In the event that themultiplier is known in advance to be less than 0-5, it maybe multiplied by two and the postmultiplication left shiftomitted, providing an opportunity to increase precision byone more significant digit).

Modification of the integer routines described inSections 2 and 3 for fractions simply involves addition ofthe instruction

ADD HL, HL (IB, 11T)

to double the product in the HL register pair. In the case ofthe hardware multiplications, where the product is in DE,


two instructions are needed:

EXADD

HL,DEHL,HL

(1B,4T)(IB, 11T)

which transfers the doubled result to HL (there is no singleinstruction to perform a left shift of DE).

The double-length product then always ends in 0.However, in applications where the product is rounded ortruncated to 8-bits before subsequent use, this is of noconsequence.

Rounding to 8-bits requires that the upper byte isincremented when bit 7 of the lower byte is unity. Unfor-tunately, this apprarently trivial requirement increasesexecution time by up to 20T:

BITJRINC

7,

z,H

LTRUNC-S

2B, 8T2B, 12(7)T1B,4T

TRUNC

The conditional branch can be avoided at the expense ofoverwriting the accumulator contents:

LDRLCAXORADCLD

A,L

AA,HH,A

1B,4T1B,4T1B,4T1B,4T1B,4T

5 Forming a sum of products

A common requirement is the calculation of a sum ofproducts:

N

s = Z apcti = i

If the coefficients a, are stored from memory locationCOEFFS onwards and the variables xt are stored fromDATA onwards, the IX and IY registers can be used aspointers and the result accumulated in HL. The calculationinvolves two nested iterative loops: the inner loop of codehas to be executed &/V times for a sum of N products of8-bit words so that the best guideline for minimisingexecution time is to keep this inner loop (whether 'opened-up' or not) as fast as possible.

There are many ways of implementing the sum-of-products procedure. The stack facilities of the Z80 may beused to save the intermediate sum while calculating thenext product, at each iteration of the outer loop:

HL: = 0i: = 1while (i < N) do push HL to top of stack

HL: = product (ai5 x4); get sum of previous products:-pop top of stack to BC; add to new product in HL:-HL:=<HL> + <BC>

enddo; sum of products in HL

A hardware multiplier accumulator may be used for sum-of-products calculation where speed is of overridingimportance. This involves simply writing successive pairs

(a,-, *,-) to the input registers and reading the final productfrom the output register.*

For applications such as nonrecursive digital filtering,where both N and the coefficients a,-, are normally knownbefore execution, the automatic-code-generation proceduredescribed in Section 2.6 can be readily extended to generatea sum-of-products routine optimised for minimum executiontime.

6 Conclusions

There are no formal methods of proving execution-timeminimality of programs, so that by using different algor-ithms or by altering the details of their implementation inassembly-language, further reductions in execution time ormemory requirement may be possible. Therefore, sufficientdetails of the software have been included to justify thenumerical results presented, and to provide an opportunityto the reader to try to improve on them. All assembly-language routines included have been tested. The methodsare, of course, applicable in principle to any computer andnot only to 8-bit microprocessors. However, because micro-processors have a short wordlength and are frequently usedin low-cost real-time applications, precision, executionspeed and memory requirements are factors of particularsignificance.

The Z80 processor is very similar to the 8080, exceptfor a number of additional c.p.u. registers (not used in mostof the examples) and a better selection of rotate, shift,bit test and branch instructions (some of which are used inthe examples). The numerical results presented in Fig. 3 are

60

50-

* 3 ° rI 20

10-

3

2

1

50 100

60

50

it AU

A^ 3 0o| 20

10

-

-

-

-

-9•

8

a

7

6

Fig. 3

50execution time (A MHz clock),

bMultiplication

100

a Positive integer multiplication. 1 = minimum memory, 2 =minimum time, 3 = minimum time, known multiplier, 4 = hard-ware multiplier, 5 = table look-up with known multiplier*

b Signed 2's complement multiplication. 6 = minimum memory,7 = minimum time, 8 = quarter-squares table look-up+, 9 = hard-ware multiplier, 10 = table look-up with known multiplier*(+ = 512 additional bytes for table)

For example the TRW TDC1008J, which multiplies 2's complement8-bit number pairs in 70 ns, and accumulates them up to a 19-bitproduct. This component is, of course, unnecessarily fast foroperation with the Z80 processor.

COMPUTERS AND DIGITAL TECHNIQUES, JUNE 1979, Vol. 2, No. 3 111

thus not applicable without modification to the 8080.However, the approach used, and the relative merits of thealternative algorithms, should be valid for the 8080 andmost other 8-bit processors.

7 References

1 DA VIES, A.C., and FUNG, Y.T.: 'Interfacing a hardware multi-plier to a general-purpose microprocessor', Microprocessors,1977,1, pp. 425-432

2 'Z80-Assembly language programming manual'. Zilog. Inc.Cupertino, California, USA, Feb. 1977

3 OSBORNE, A., KANE, J., RECTOR, R., and JACOBSON, S.:'Z80 programming for logic design' (Osborne and Associates,Inc., Berkeley, California, 1978)

4 LEWIN, D.: Theory and design of digital computers' (Nelson,London, 1972)

5 KOCHER, K.D.: Techniques of multiplication and division inautomatic binary digital computers with special reference to anew multiplication process', Q. J. Mech & Appl. Math., 11,1958,pp. 364-384

6 WOODWARD, M.E.: 'Microprogrammable digital filterimplementation using bipolar microprocessors'. Conference onmircroprocessors in automation and communications, Universityof Kent, Canterbury, Sept. 1978, pp. 131-142

7 BOOTH, A.D.: 'A signed binary multiplication technique',Q. J. Mech & A >pl Math., 1951,4, pp. 236-240

8 REITWEISNEL, G.W.: 'Binary arithmetic' in 'Advances incomputers', vol. 1 (Academic Press, N.Y.,) 1960, pp. 231-308

9 CHANG, T-L.: 'Binary read-only-memory multiplier', Electron.Lett., 1973, 9, pp. 580-581

8 Appendix

Calculation of the optimum ternary coding

Given the n-bit multiplier a, the optimum decompositionb — c can be calculated as follows:

fmin: = aw (a)cmin: = 0b: = ac : = 0while (b < 2n - 1 .and.fmin > a +

Moreover, the explicit expression applies for signed (2'scomplement) integers, and not only for positive integers.

Denoting the coefficients of the n-bit multiplier a by ai}

and the coefficients of the ternary decomposition

a = -a727+ I a{l\i=0

i=0

the unique set of coefficients t,• which minimise

r = i \tt\J = 0

may be calculated as follows:

Define a_x = q_x = 0 and a8 = a7

i : = 0while (i < 8) do p: = at © aA_j

q i : = ( l © q i - i ) pt i : = ( l - 2 a i + 1)qi

endwhile

A property of this optimum decomposition is that anadjacent pair of coefficients tt and tt-x are never bothnonzero, which is useful if the processor has a 2-place shiftinstruction.

dob:=b + lc : = c + lif ((b.and.c) = 0) do f: = aw(b) + 0w(c)

if (f < fmin) fmin: = fcmin: = c

endifenddo

endifendwhileb: = a + cminc: cminprint fmin, b,c

The first condition of the 'do-while' is to keep b from over-flowing the n-bit wordlength, and the second condition isto terminate the iteration if integers b and c both haveunity Hamming weight (for which no further improvementis possible). For the optimum b and c, the nonzero bits ofeach must be in different locations (otherwise this wouldimply an addition and subtraction of the same partialproduct). Hence, any pairs for which (b.and.c) is nonzeroare skipped by the first 'if statement.

The above algorithm exhaustively tests all possibledecompositions for minimum / . For the special case that aand 0 are equal (that is, the execution times for adding andsubtracting a partial product are equal) it has been shownthat the optimum decomposition is unique, and an explicitexpression has been derived for it.8


Date post:	20-Sep-2016
Category:	Documents
Upload:	ac
View:	213 times
Download:	1 times

Trade-offs in fixed-point multiplication algorithms for microprocesors

Documents