+ All Categories
Home > Documents > High speed merged array multiplication

High speed merged array multiplication

Date post: 20-Jan-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
12
Journal of VLSI Signal Processing, 10, 41-52 (1995) 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. High Speed Merged Array Multiplication FARHAD FUAD ISLAM AND KEIKICHI TAMARU Department of Electronics, Kyoto University, Sakyo-Ku, Kyoto 606-01, Japan Received October 30, 1992; Revised February 22, 1993 m-I Abstract. Multiplication-accumulation operations described by )--~k=0 Ak Bk represent the fundamental computa- tion involved in many digital signal processing algorithms. For high speed signal processing, one obvious approach to realize the above computation in VLSI is to employ m discrete multipliers working in parallel. However, a more area efficient approach is offered by the merged multiplication technique [5]. But the principal drawback of the con- ventional merged technique is its longer latency than the former discrete approach. This work proposes a hardware algorithm for merged array multiplication which eliminates this drawback and achieves significant improvement in latency when compared with the conventional scheme for merged multiplication. The proposed algorithm utilizes multiple wave front computation as opposed to the traditional approach where computation in an array multiplier is carried out by a single wave front. The improvement in latency by the proposed approach is greater than 40% (for m > 2) when compared with a conventional approach to merged multiplication. The consequent cost in the form of additional requirement of VLSI area is found to be rather small. In this paper, we provide a thorough analytic discussion on the proposed algorithm and support it by experimental results. 1 Introduction For signal processing systems, the algorithms to be implemented, such as recursive filters, finite impulse response filters, correlation computation, Fourier trans- form, etc., require varying sequences of multiplication, addition and subtraction. One traditional approach to the design of digital signal processing systems and other special purpose processors, is to implement dis- crete arithmetic functions (such as adders, subtrac- tors, multipliers and dividers, etc.). These arithmetic functional elements are then interconnected to real- ize the desired algorithms [1]-[4]. Another approach, called merged arithmetic scheme and introduced by Earl E. Swartzlander in [5], dissolves the bound- aries separating the discrete arithmetic elements. The merged approach involves synthesizing a composite arithmetic function directly instead of decomposing the function into discrete multiplication and addition operations. The aim is to realize an arithmetically equivalent but more VLSI area efficient merged sys- tem. In the particular case of merged multiplications, the major advantage obtained is the reduction in VLSI area at equivalent throughput rate. However, one prin- cipal drawback in the merged approach is the increased latency compared to a discrete multiplication scheme where an integer number (=m) of discrete multipliers work in parallel. The reason for this demerit lies in the increase in the number of partial products which have to be added by a merged multiplier. Over the years, many improvements have been made to the architecture of the discrete multipliers men- tioned above; these fall into two general categories: "tree" multipliers and "array" multipliers. Tree mul- tipliers [6] add as many partial products in parallel as possible and therefore are very high speed architec- tures, at least theoretically. Swartzlander [5] discussed on merged multiplication while primarily consider- ing the tree type of multipliers. Unfortunately, tree multipliers are very irregular [7], hard to layout and hence occupy large VLSI area. Moreover, the signal propagation delay introduced by the significant wiring capacitances in case of tree multipliers sometimes over- shadows their expected theoretical high performance. On the other hand, array multipliers are very regular [8], small in size, but have the disadvantage of longer latency when compared with the tree structure [9]. Thus, inspite of their relatively longer delays, the excel- lent repeatability of unit computation cells makes array multipliers very attractive for VLSI realizations. This outlook makes it necessary to investigate thoroughly the merged multiplication by array multipliers. A merged array multiplier that deals with complex multiplications has already been discussed in [10]. A
Transcript

Journal of VLSI Signal Processing, 10, 41-52 (1995) �9 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

High Speed Merged Array Multiplication

FARHAD FUAD ISLAM AND KEIKICHI TAMARU Department of Electronics, Kyoto University, Sakyo-Ku, Kyoto 606-01, Japan

Received October 30, 1992; Revised February 22, 1993

m-I Abstract. Multiplication-accumulation operations described by )--~k=0 Ak Bk represent the fundamental computa- tion involved in many digital signal processing algorithms. For high speed signal processing, one obvious approach to realize the above computation in VLSI is to employ m discrete multipliers working in parallel. However, a more area efficient approach is offered by the merged multiplication technique [5]. But the principal drawback of the con- ventional merged technique is its longer latency than the former discrete approach. This work proposes a hardware algorithm for merged array multiplication which eliminates this drawback and achieves significant improvement in latency when compared with the conventional scheme for merged multiplication. The proposed algorithm utilizes multiple wave front computation as opposed to the traditional approach where computation in an array multiplier is carried out by a single wave front. The improvement in latency by the proposed approach is greater than 40% (for m > 2) when compared with a conventional approach to merged multiplication. The consequent cost in the form of additional requirement of VLSI area is found to be rather small. In this paper, we provide a thorough analytic discussion on the proposed algorithm and support it by experimental results.

1 Introduction

For signal processing systems, the algorithms to be implemented, such as recursive filters, finite impulse response filters, correlation computation, Fourier trans- form, etc., require varying sequences of multiplication, addition and subtraction. One traditional approach to the design of digital signal processing systems and other special purpose processors, is to implement dis- crete arithmetic functions (such as adders, subtrac- tors, multipliers and dividers, etc.). These arithmetic functional elements are then interconnected to real- ize the desired algorithms [1]-[4]. Another approach, called merged arithmetic scheme and introduced by Earl E. Swartzlander in [5], dissolves the bound- aries separating the discrete arithmetic elements. The merged approach involves synthesizing a composite arithmetic function directly instead of decomposing the function into discrete multiplication and addition operations. The aim is to realize an arithmetically equivalent but more VLSI area efficient merged sys- tem. In the particular case of merged multiplications, the major advantage obtained is the reduction in VLSI area at equivalent throughput rate. However, one prin- cipal drawback in the merged approach is the increased latency compared to a discrete multiplication scheme where an integer number (=m) of discrete multipliers

work in parallel. The reason for this demerit lies in the increase in the number of partial products which have to be added by a merged multiplier.

Over the years, many improvements have been made to the architecture of the discrete multipliers men- tioned above; these fall into two general categories: "tree" multipliers and "array" multipliers. Tree mul- tipliers [6] add as many partial products in parallel as possible and therefore are very high speed architec- tures, at least theoretically. Swartzlander [5] discussed on merged multiplication while primarily consider- ing the tree type of multipliers. Unfortunately, tree multipliers are very irregular [7], hard to layout and hence occupy large VLSI area. Moreover, the signal propagation delay introduced by the significant wiring capacitances in case of tree multipliers sometimes over- shadows their expected theoretical high performance. On the other hand, array multipliers are very regular [8], small in size, but have the disadvantage of longer latency when compared with the tree structure [9]. Thus, inspite of their relatively longer delays, the excel- lent repeatability of unit computation cells makes array multipliers very attractive for VLSI realizations. This outlook makes it necessary to investigate thoroughly the merged multiplication by array multipliers.

A merged array multiplier that deals with complex multiplications has already been discussed in [10]. A

42 Islam and Tamaru

unique inter-connection scheme for computation cells (full adders) was proposed in this work which signif- icantly reduced latency than that of the conventional merged scheme. Now, in a conventional array multi- plier (discrete or merged), the computation proceeds from one row of unit computation cells (full adders) to a subsequent row in the form of a single travelling wave front (WF). Though it was not explicitly mentioned in [10], we may interpret the computation in its merged array as being carried out by two distinct computation WFs. This particular interpretation inspired us to in- vestigate the possibility of generating multiple compu- tation WFs during merged array multiplication. In this work, we propose a hardware algorithm for computa- tion by multiple WFs in merged array multipliers. The resulting major contribution is that our hardware algo- rithm offers faster computation (i.e., shorter latency) than the conventional merged array multiplier and suf- fers no deterioration in latency when compared to a discrete multiplication scheme. The improvement in the former case is greater than 40% when m > 2 and the inherent advantage of a merged concept promises attractive reduction in area when compared with the later (discrete) scheme. This result is particularly at- tractive for "real world" systems which employ a large number of discrete multipliers to achieve high through- put computation. A 3 x 3 two-dimensional image con- volver (filter) chip [12] can be cited as one example in this regard where 9 discrete multipliers (i.e., m = 9) work in parallel. Another point to note is that our pro- posed algorithm maintains high regularity in placement of unit computation ceils and intercell connections for efficient VLSI realization.

In this paper we thoroughly investigate the area-time complexity of the proposed algorithm by taking into consideration such variables as bit-size of operands (=n) and number of discrete multipliers (=m) that are merged by organizing the rest of this paper as follows: In Section 2 we review some conventional schemes for array multiplication. Our algorithm is proposed in the next section. In Section 4 a comparison among the var- ious approaches is discussed analytically. Some exper- imental results are then provided in Section 5. Finally, a conclusion is drawn in the last section.

2 Conventional Schemes for Array Multiplication

To compute an output expressed by (1), the conven- tional schemes which employ array multipliers may be divided into two categories:

1. Conventional discrete array multiplication (CDAM)

2. Conventional merged array multiplication (CMAM)

m - - 1

output = ~ AkBk (1) k=0

X-~n -- 1 where IAk[ = z._,j=0 2Jakj, with akj = 0, 1, and IBkl = n--I ~--~4=0 2ibki' with bki = 0, 1.

A Conventional Discrete Array Multiplication

In this case, each n-bit multiplication is performed by a single n-bit array multiplier, m such multipliers work in parallel. Numerous excellent texts (e.g., [11]) exist on such discrete array multipliers, so we will not elabo- rate on them. Briefly speaking, such an n-bit unsigned array multiplier consists of (n - 1) 2 full adders (FAs) arranged in an array to add n-bit wide partial product words (PPWs) in a carry save fashion. The FA array is followed by an (n - l)-bit fast adder [typically a carry look-ahead adder (CLA)]. Computation inside the FA array propagates as a single wave front, starting at the top row of FAs and proceeding towards the CLA.

The m discrete multipliers are followed by an m- input FA array with each input being 2n-bit wide. This adder array reduces m numbers to be added into two numbers which are ultimately added by a ['log2{m(2 n - 1)2}I-bit fast adder. This final fast adder is also typically a CLA. Figure 1 illustrates the CDAM scheme. Referring to this figure we can easily find out the expressions for latency (=LcDAM) in terms of n and m.

LCOAM = Latency of a discrete array multiplier

+ Latency of m-input FA array

+ Latency of final 2n-bit CLA

= TAND "[- (n - 1)TFA + TCLA(n-1)

+ (m -- 2)TFA + TCLA(x) (2)

here

TAN D = latency of a 2-input AND gate

TFA = Latency of a 1-bit FA

TCLA(q) = Latency of a q-bit CLA, where n < q < x

and x = r l~ m(2n - 1)2}7

As mentioned in [11], on the basis of a first approxi- mation:

2 TAND "~ "~TFh = (3)

High Speed Merged Array Multiplication 43

Fig. 1.

#o #1

L

Conventional discrete array multiplication (CDAM) scheme.

0 0 0 0

2 n

FA array

Ci_A

output

#m-1

and when n >__ 8,

TCLA(q) ~ 4TFA (4)

Substituting (3) and (4) in (2) we get:

LCDAM ~ (n -1- m d- 5.67)TFA (5)

Note that, while deriving (5), we have neglected the latency introduced by wiring interconnections.

B C o n v e n t i o n a l M e r g e d A r r a y M u l t i p l i c a t i o n

The basic concept behind conventional merged multi- plication has already been discussed in [5]. So, here we will confine our discussion mainly on the array multi- plier that is constructed using this concept. Such a mul- tiplier (see Fig. 2) employs a single array of ( n m - 1) rows of FAs to add up all the m n partial products involved with m n-bit multiplications of the CDAM scheme. The FA array is followed by a single CLA. To specify the partial products in the CDAM scheme in a convenient manner, let us denote by PPWi the i-th partial product word (n-bit wide) of a discrete multi- plication, A k x Bk where:

A k (Bk ) = n-bit multiplicand (multiplier number)

and

PPWi = A~ �9 bki , 0 < i < n - 1

The m n PPWs involved in m n-bit discrete multiplica- tions may be distinguished by n groups with m PPWi words in each group. We denote such groups by

Fig. 2. scheme.

rl

rail G/PPWo/ G(PPWl) I

IG/PPWnl /I I CLA I in

v

output (x-bits)

Conventional merged array multiplication (CMAM)

G(PPWi), where 0 < i < n - 1. Now, in the con- ventional merged array multiplier, the m PPWi words in G(PPWi) are added in a carry save fashion by suc- cessive m rows of FAs. The computation propagates as a single WF beginning with G(PPWo) and ending after G (PPW,_ l ). The resulting partial sum and partial carry are ultimately fed to an (x - n)-bit CLA located after the merged array.

Now, as before, neglecting latency introduced by wiring interconnection, the latency of the conventional

44 Islam and Tamaru

Fig. 3. A unit computation block (UCB) of proposed algorithm.

Key:

S: Sum output

#m-1 C: Carry output P: Partial product

bit

i+j

2n-2 n-1

I I o J o * l

=

. . - I - - I - - . I �9 UCB(i,i+j)

1 I . . . . . i - - - 1 1 1

1 o

I II 10

I 1

n-1

Fig. 4. (i. i + j ) coordinate system.

merged array multiplication (=LcMAM) may be deter- mined as follows:

LCMAM = Latency of FA array + Latency of CLA

= TAND "b (nm -- 1)TFA + TCLA(x-n) (6)

Using (3) and (4) in (6) we have:

LCMAM ~ (nm + 3.67)TFA (7)

3 Proposed Hardware Algorithm

In our proposed hardware algorithm, the mn partial product words related to m n-bit discrete multiplica- tions are added by an array of n • n unit computation blocks (UCBs). The UCB array is followed by a carry save full adder (CSFA) array. This CSFA array receives outputs from the UCB array and reduces them into two numbers. These are then added by a CLA to produce the final result.

A UCB Array

Each partial product group G(PPWi) mentioned pre- viously while describing CMAM has one row of n UCBs assigned to it in the proposed approach. Each UCB (see Fig. 3) consists of m FAs and can absorb m partial product bits. Thus the mn partial product bits belonging to G (PPWi) of the CMAM scheme are absorbed by the n x m FAs of the n UCBs located in one row of our proposed algorithm. We now assume the (i, i + j ) coordinate system shown in Fig. 4 to specify the UCBs in a convenient way. According to this coordinate system, the (i, j)-th partial product bit (akj • bki) of the n-bit discrete multiplication Ak x Bk is computed with UCB (i, i + j ) , where Ak and Bk are expressed by (1).

The intercell connection of UCB(i, i + j ) with its surrounding UCBs is shown in Fig. 5(a). We begin the description with the intra-cell connection pattern of a UCB. From Fig. 3 it can be seen that within a UCB,

High Speed Merged Array Multiplication 45

i+j 1

Key: 0 : 1-bit full adder I J

UCB(i-1 ,i+j+l) UCB(i-l,i+j) UCB~-I ,i+j-~ .

f f

Fig. 5a. Intercell connection pattern in UCB array.

13B o t-

n UCBs

/uc d array

L nce,s J

ICSFA array']

I I ,,~ n

v output (x-bits)

Fig. 5b. Outline of architecture of proposed scheme.

each FA (except one) receives a l-bit sum output from its previous FA. The lone FA not receiving a sum output from a previous FA may be termed as the "first" FA because of its location in the FA chain inside the UCB. Consequently, if we denote this first FA of UCB (i, i-t-j) as FA(i, i + j , 0), then the last FA can be expressed by FA(i, i 4- j , m - 1) and any intermediate FA in UCB(i , i + j ) can be expressed by FA(i, i + j , v)

where 1 < v < m - 2. The assignment of inputs to the FAs and intercell connection pattern of UCBs are now described below [see Fig. 5(a)]:

1) First FA of UCB(i, i + j ) : FA(i, i + j , 0) has two of its inputs assigned to absorb two partial product bits. Its third input receives a carry output from FA(i - 1, i 4- j - 1, 0).

2) An intermediate FA o fUCB (i, i + j ) : FA(i, i + j , v) has one input assigned to receive a sum of output from FA(i, i 4- j , v - 1). Another input receives a partial product bit. The last (third) input receives a carry output bit from FA(i - 1, i + j - 1, v).

3) Last FA of UCB(i, i + j ) : FA(i, i + j , m - 1) has one input assigned to receive a sum output from FA(i, i -F j , m - 2). Another input receives a carry output from FA(i - 1, i + j - 1, m - 1). The last input receives a sum output from FA(i - 1, i + j , m - 1).

B Computation by Multiple Wave Fronts

The computation is carried out by generating m compu- tation WFs in the array of the UCBs (see Figs. 4 and 5). The first WF begins with n carry output bits produced by the computation of the n FAs denoted by FA(0, j , 0), where 0 < j < n - 1. The n sum-output bits from these FAs are received by the n FAs denoted by FA(0, j , 1). The second WF begins with the n carry output bits f rom

46 Is lam and Tamaru

UCB(n-l,n) J UCB(n-l,n-1)

u _ . . . . I t . . . . . . . . . . . . . . . ~ ' - ' t

~: ,,

I s u m = COout -= '~'Jsum ~ - ~ ~ ~ ' - , ; . CO in Cl-0ut - ' J

Cm-2-out ~ ~ _ ~ Cm-2_in

. . . . . ~ . . . . . . . . . . . ~r"ry-. b;- I - ;-sum bit CSFA cell

I

[~ Fig. 5c. A cell of CSFA array and its interconnection pattern with UCB array.

FA(0, j , 1). In this way, m - 1 WFs representing carry output bit propagations are generated one after another. The last WF, however, represents carry as well as sum output bits. This last (m-th) WF is generated after computation by the last FAs in UCB(0, j ) denoted by FA(0, j , m - 1). Thus the m-th WF is generated after a latency of m • TFA time units. This WF then re- quires (n - 1) x TFA time units to reach the outputs of UCB(n - 1, n - 1 + j ) . Thus the latency of UCB array becomes:

LUCB = mTFA 4- (n -- 1)TFA = (m 4- n -- 1)TFA

C CSFA Array and CLA

Each W E except the last (m-th) one, represents an n- bit binary number. However, the last WF consists of sum and carry bits; therefore, it is equivalent to two n-bit binary numbers. Since the last row of UCB array generates m WFs, therefore there are (m + 1) binary numbers (each n-bit wide) which have to be added to obtain the final result. We may employ an array of n x (m - 1) FAs to add these (m + 1) n-bit numbers in a carry save fashion. This array of FAs has been called the carry save full adder (CSFA) array. The two output numbers produced by the CSFA array are then added by an (x - n)-bit CLA to produce the final result. Now, the CSFA array may be constructed by n cells such that each cell consists of (m - 1) FAs and adds (m + 1)-bits which belong to the same bit-position. It should be noted that, among these (m + 1)-bits, m- bits are carry outputs from the m FAs of UCB(n - 1, n - 1 + j ) ; whereas, the remaining single bit is obtained from the sum output of the FA denoted by FA(n - 1, n + j , m - 1). Obviously, FA(n - 1,n + j , m - 1)

belongs to UCB(n - 1, n + j ) which is adjacent to UCB(n - 1, n - I + j ) . An example of connection pattern of a CSFA cell with the UCB array is shown for j = 0 in Fig. 5(c).

It is interesting to note that, in order to activate the CSFA array located after the UCB, we do not have to wait for the last WF to reach it. This CSFA array starts addition as soon as the first three WFs are present. Successive WFs are added by successive rows of FAs in the array. A close inspection reveals that for m > 2, the output from the CSFA array will be available just two FA latencies after the m-th WF has reached it. Thus the CSFA array in effect contributes only two FA latencies to the overall latency of our proposed scheme. Thus the total latency of our proposed approach may be expressed as:

Lproposed = Latency of {AND gate + UCB array

+ CSFA array + CLA}

= TANO + (m + n -- 1)TvA

+ 2TFA d- TCLA(x-n) (8)

Now, using the previous first approximations of (3) and (4) in (8) we have:

Lproposed ----- (n + m + 5.67)TFA (9)

4 Analytic Comparison of Different Approaches

The different approaches to array multiplication schemes discussed in previous sections will now be compared with each other. The basis of comparison will be computation latency and VLSI area.

High Speed Merged Array Multiplication 47

Fig. 6.

80

70

J 60

ro r

-- 50 r-

E 40

e E 30

20

m=6

m=5

m=4

m=3

m=2

0 8 16 32 48 64

Bit size (n)

%Improvement in latency by proposed approach over conventional merged array multiplication (CMAM) scheme.

A Comparison of Latency can also be derived from (11) as follows:

m - 1 -- (m /n ) - (2/n) This comparison is divided into two categories: IL,~ = x 100% 1) Comparison between proposed approach and m + ( 3 . 6 7 / n ) CMAM and 2) Comparison between proposed ap- ( 1 ) proach and CDAM. ~- 1 - x 100%, when n ---> oo

I) Pmposed Approach vs. CMAM. Considering (7) and (9) it is obvious that Zproposed < LCMAM when n > 8 and m > 2. Thus we may determine %improve- ment in latency (=Iz) by the proposed scheme over CMAM as follows:

LCMAM -- Lproposed IL = X 100% (10)

LCMAM

Substituting analytic expressions of IL from (7) and (9) in (I0) we have the analytic improvement in latency as:

nm - - n - m - 2 IL,a -~ x 100% (11)

nm + 3.67

The plot of Im,a VS. n is given in Fig. 6 for different val- ues of constant m. Now, for a particular value of m, 1L,a is seen to gradually attain higher magnitudes with in- creasing bit-size (=n). The approximate asymptotic (i.e., when n ---> oo) value of Im,, for a particular m

(12)

For example, using (12), the maximum values of Im,a for m = 2, 3, and 4 can be calculated as 50%, 66.7%, and 75% respectively. It can also be inferred from (12) that the maximum improvement in latency cannot reach 100% even when the degree of merging (=m) is very large.

2) Proposed Approach vs. CDAM. A comparison between (5) and (9) reveals that our proposed scheme suffers no deterioration in latency when compared with a CDAM scheme. In fact, due to simpler wiring inter- connections, some improvement in latency is expected in the proposed approach.

B Comparison o f Area

An analytic comparison of areas between the proposed scheme and the CMAM scheme is feasible because: 1) both have a compact array of computation elements;

48 I s lam and Tamaru

Key:

~ : F A

C : Carry output S : Sum output

n w t4

(a)

mnw lq ~~

= "..

~ m+l ~' m-1 r . . . . . . . . . . - - - -" . . . . . ~ ' , m-1

(b)

Fig. 7. Rectangular FA area of (a) CMAM scheme. (b) Proposed scheme.

and 2) the compact array is followed by a single CLA in both cases.

However, such a comparison seems to be impracti- cal when the CDAM scheme is included, because its architecture is quite different from the two merged ap- proaches. On the other hand, it has already been estab- lished by [5] that CMAM has a much smaller area than that of CDAM. So, here we will confine our compar- ison between CMAM and the proposed schemes. In the light of Section 2.B and 3, and also by inspection of Figs. 2 and 5(b), it is apparent that a FA array re- sponsible for receiving partial product bits, and a CLA to produce final result are common to both the CMAM and our proposed approach. However, the CSFA ar- ray mentioned in Section 3.B incurs some additional VLSI area in the proposed approach. This introduces some deterioration (i.e., increase) in area when com- pared with the CMAM scheme. Since a CLA is present in both cases, we will derive an analytic expression for %deterioration in area while considering the area of the FA array only.

Now, in an actual VLSI layout, a FA array is usu- ally transformed into a rectangular block. Rectangu- lar blocks representing the FA arrays of the CMAM and our proposed scheme are shown in Fig.. 7. As shown in this figure, w and h are used to denote the width and height of a FA cell (including intercell con- nections). We also denote by ' t ' the width of a wire including inter-wire spacing. The height of the CSFA array shown in Fig. 5(b) is therefore represented by (h + m t ) in Fig. 7(b). We now deduce expressions for FA areas of the two approaches under consideration.

ACMAM = n w x (ran -- 1)h (13)

and

Aproposed = area of UCB array + area of CSFA array

= m n w • nh q- m n w • (h -k- rot)

= m n w ( n h + h + mr) (14)

We define %deterioration in a r e a ( = D A ) as:

DA = A p r o p o s e d - ACMAM • 100% (15) ACMAM

The analytic %deterioration in area (=DA,a) is ob- tained by substituting (13) and (14) in (15) as:

{ m + l t m ' } Da,a = ~ - I - - ~ x - - r a n _ 1 x 100% (16)

Due to practical limitation on the maximum allowable chip area, for many real world digital signal processors,

nl 2 m < n. Therefore, m 2 < m n - 1 and ~ < 1. Also, it is obvious that the width of a wire (=t) is much smaller than the height of a full adder (=h), i.e.,

I?l 2 t << h. Thus, ~ • ~ << 1 and we may neglect this term from (16) while maintaining reasonable accuracy. (16) is therefore simplified as:

m + l Da,a -'~ - - • 1 0 0 % (17)

m n - 1

A plot of DA, a is given in Fig. 8 which reveals that the deterioration in area is less for higher values of m. It is interesting to note that this deterioration in area by our proposed approach is rather small, e.g., when m = 2 and n = 16, 32 and 64 DA,a = 9.68%, 4.76% and 2.36%. Also note that for a particular m, DA,a decreases with increasing n. As regarding the limiting

High Speed Merged Array Multiplication 49

Fig. 8.

13 v

.=

t-"

t - O

0

20

18

16

14

12

10

8

6

4

2

m=2

t11=5

8 16 32 64

Bit size (n)

%Deterioration in area by proposed approach compared to conventional merged array multiplication (CMAM) scheme.

(i.e., when n --+ c~) value of DA,a, it can be con- cluded from (17) that, DA,a "--> 0 when n --+ oo. Thus the proposed scheme suffers negligible deterioration in area against CMAM for large bit-size operands.

5 Experimental Verification

The objective of experimental evaluation was to in- vestigate the possible deviations of (11) and (17) from results obtained by actual (real world) implementations in VLSI.

A Evaluation Procedures

We carried out detailed layouts for the proposed as well as conventional schemes while assuming m = 2 and n = 8. The well known layout tool called CA- DENCE was used for designing with 1.5/z double metal CMOS. For determining latency accurately, the cir- cuit simulator called HSPICE [13] was utilized. By considering the detailed layout diagrams, all wiring capacitances were meticulously included during simu- lation by HSPICE. Note that all three array multipliers under consideration, namely CDAM, CMAM and the proposed one, consist of arrays of FAs and CLAs. To determine the latency of the FA array, we simulated one

FA in great detail while taking into account all intra- cell as well as input/output capacitances. The latency of the whole array was then determined by adding FA- cell latencies in its longest path. The latency of wiring between the FA array and the CLA was taken into con- sideration by including it as a ramp capacitance at the output of the corresponding FA cell. The latency of a CLA was also determined by HSPICE while taking into account all wiring capacitances. Finally, the la- tency of a particular scheme was determined by adding the latencies of the FA array and that of the CLA.

B Results and Discussion

1) Latency. For the particular case of m = 2 and n = 8 in 1.5/z double metal CMOS, the latencies of CDAM, CMAM and the proposed approach came out to be:

LCDAM = 26.0 ns (18)

LCMAM = 32.55 ns (19)

Lproposed = 25.55 ns (20)

Using the above values in (10), the experimental im- provement in latency was determined as"

It.,x = 21.51% (21)

50 Islam and Tamaru

-I

Key:

AND: Logic AND gate FA: 1-b full adder

sout: sum o u t p u t cout:carry output

sin:sum input PP: Partial cin: carry input product

UCB

UCB

UCB

I i i _ ~ i i

I nIT " '" o_ FA2

(a)

Fig. 9. (a) Section of UCB array. (b) Layout of a UCB cell.

By substituting n = 8 and m = 2 in (11), the corre- sponding analytic value may be computed as:

IL,a = 20.34% (22)

(21) and (22) are fairly close to each other. This strongly supports the validity of (11) and the plots of Fig. 6 which were based on it. Another interesting fact can be observed while considering (18) and (20). Al- though (5) and (7) show equal latencies for the CDAM

and the proposed scheme, the CDAM approach is actu- ally seen to have slightly longer latency due to greater inter-connection wiring complexity and corresponding capacitive delay.

2) Area. The layout scheme for a section of the UCB array is shown in Fig. 9(a). Each UCB was constructed from two FAs (since m = 2). The layout of one such UCB is shown in Fig. 9(b). The area of the 8 x 8 UCB array (since n = 8) was determined

High Speed Merged Array Multiplication 51

as 4.21 mm 2. The width of rectangular CSFA array

was equal to that of preceding UCB array. The area of this CSFA array was found to be 0.554 mm 2 while taking into account wiring inter-connections. Thus the combined layout areas of the UCB array and the CSFA array in our proposed approach came out to be, Aproposed = (4.21 -t- 0.554) mm 2 = 4.764 mm 2. Again, as regarding the CMAM scheme, the area occupied by

its FA array was measured from the prepared layout and was found to be, ACMAM = 3.87 mm 2. Using the

above values of Aproposed and ACMAM in (15), the exper- imental deterioration in area by our proposed approach

was determined as:

DA,x = 23.1% (23)

whereas the corresponding analytic value can be cal- culated by substituting n = 8 and m = 2 in (17) as:

DA,a = 20% (24)

There is some discrepancy between (23) and (24). This is primarily due to the simplification of (16) to arrive at (17). However, the discrepancy between (23) and

(24) is small enough to support validity of (17) and

corresponding plots in Fig. 8.

References

1. B. Gold and C.M. Rader, Digital Processing of Signals, New York: McGraw-Hill, 1969.

2. A.C. Salazar, Digital Signal Computers and Processors, New York: IEEE, 1977.

3. A. Peled and B. Liu, Digital Signal Processing, New York: Wiley, 1976.

4. A.V. Oppenheim and R.W. Shaffer, Digital Signal Processing, Englewood Cliffs, NJ: Prentice-Hall, 1975.

5. E.E. Swartzlander, Jr., "Merged arithmetic," IEEE Trans. Com- puters, Vol. C-29, pp. 946-950, 1980.

6. C.S. Wallace, "A suggestion for a fast multiplier," IEEE Trans. Electron. Computers, Vol. EC-13, pp. 14-17, Feb. 1964.

7. K.E Pang et al., "Generation of high speed CMOS multiplier- accumulators," Proc. ICCD, pp. 217-220, Oct. 1988.

8. Y. Oowaki et al., "A 7.4 ns CMOS 16 x 16 multiplier," ISSCC Dig. Tech. Papers, pp. 52-53, Feb. 1987.

9. C.C. Steams, "Yet another multiplier architecture," Proc. CICC, pp. 24.6.1-24.6.4, 1990.

10. H. Shay and D. Gray, "Highly dense VLSI complex array mul- tiplier," Proc. CICC, pp. 249-251, 1983.

11. K. Hwang, Computer Arithmetic, New York: Wiley, 1979. 12. Manual of "High speed two dimensional image convolver:

KP5D48908," Kawatetsu, version 1.2.2 (in Japanese). 13. HSPICE user's manual: H9001, Meta-Software, Inc., 1300

White Oaks Road Campbell, CA 95008, USA.

6 Conclusion

In this paper a hardware algorithm for merged array multiplication technique was presented. The proposed approach offers impressive improvement in latency

when compared with a conventional scheme for merged array multiplication. The cost in the form of additional VLSI area is rather small and decreases with increasing bit-size of operands. When compared with a discrete parallel multiplication scheme, the proposed approach maintains the inherent merit of a merged concept, i.e.,

reduction in VLSI area. But the interesting feature of the proposed scheme is that it does not suffer from the

inherent demerit of a merged concept, which is deteri-

oration of latency.

Acknowledgment

Mr. Kenichi Ishii of Tamaru laboratory, Kyoto Univer- sity, Japan, was responsible for necessary VLSI lay- outs. The authors would like to thank him for his time and tireless efforts. Thanks are also due to Prof. Hiroto Yasuura of Kyushu University, Japan, for his

valuable suggestions.

Farhad Fuad Islam received the B.S. and M.S. degrees, both in Electrical and Electronic Engineering, from Bangladesh University of Engineering and Technology (B.U.E.T.) in 1986 and 1988 respec- tively. From 1986 until 1989 he served as a lecturer in the department of Electrical and Electronic Engineering of B.U.E.T. For the next four years he worked as a graduate student for his doctorate degree in the department of Electronics, Kyoto University, Japan. After getting his PhD in 1993, Dr. Islam spent two years working on VLSI hardware algorithms as a post doctoral fellow in the Human Interface Research Laboratories of Nippon Telegraph and Telephone (NTT) corporation, Japan. Since April 1, 1995, he has been working in the networked multimedia group of Hewlett-Packard Research Laboratories, Japan. Dr. Islam's research interests include architec~ures and hardware al- gorithms for high speed multipliers and digital signal processors, as well as several aspects of current multimedia communications and technology. He is a member of IEEE-USA, IEICE-Japan, lEA- Australia and the New York Academy of Sciences. [email protected]

52 Islam and Tamaru

Keikichi Tamaru was born in Sendai, Japan, in 1936. He received the B.E., M.E., and Dr. Eng. degrees in electronic engineering from

Kyoto University, Kyoto, Japan, in 1958, 1960, and 1970, respec- tively.

From 1960 to 1979 he was engaged in the development of high- speed logic circuits and microcomputers at Toshiba Research and Development Center, Toshiba Corporation, Kawasaki, Japan. Since 1979 he has been a Professor in the Department of Electronics, Kyoto University. His research interests include computer-aided design for digital and analog integrated circuits and the architecture of special- purpose VLSI processors.

Dr. Tamaru is a member of the Institute of Electrical and Elec- tronics Engineers, the Association for Computing Machinery, the Institute of Electronics, Information and Communication Engineers of Japan, and the Information Processing Society of Japan.


Recommended