High-radix digit serial division

High-radix digit serial division

AEBashagha M.K.lbrahim

Indexing terms: VLSI arithmetic, Digit serial division

Abstract: A new area-time efficient digit serial division algorithm and its architecture are presented. In existing digit serial algorithms based on 2's complement number representation, m cycles are required to generate each quotient bit (m is the number of radix-2" digits of the word, N>. In the new algorithm, K quotient bits are generated in m+K-l cycles instead of mK cycles. Performance comparisons have shown that the new algorithm is up to K times faster and moreover for small values of K it requires a smaller area than existing digit serial algorithms that are based on 2's complement number representation. More significantly, by comparing the new structure with the conventional binary bit parallel one, it has been shown that the new structure is faster and requires a smaller area.

1 Introduction

In the existing digit serial division algorithms [ 1-41, each quotient bit is generated in m cycles, where m is the number of radix-2" digits of the word, N. There- fore, the latency to calculate N quotient bits is mN cycles, which is rather long. The throughput rate of the digit serial architecture can be increased by pipelining the architecture to the bit level [4]. However, pipelining of the architecture will further increase the latency and the number of latches (i.e. the area). In [5] the authors proposed a new efficient design for digit serial division. In this design, two steps of the conventional algorithm of [ 3 , 41 are overlapped so that two quotient bits are obtained in m+l cycles instead of 2m cycles.

In this paper, a generalisation of the algorithm proposed in [5] is presented, where K quotient bits are generated in m+K-l cycles. The new algorithm is described using a tree representation, where it will be shown that the original digit serial algorithm of [3, 41 is a special case of the new algorithm (when K = 1). The corresponding structure of the basic digit serial cell which is based on this tree representation is also presented.

The new structure has been compared with the original digit serial one (i.e. K = 1) and with the binary bit parallel one that is based on 2's complement arithme-

-

0 IEE, 1996 ZEE Proceedings online no. 19960707 Paper first received 1st November 1995 and in revised form 23rd May 1996 The authors we with the Department of Electronic Engineering, De Montfort Univcrsily, Thc Gateway, Lcicester LEI 9BH, UK

tic. It has been shown that the new algorithm can be up to K times faster than the original digit serial one. Although the new algorithm requires more digital controlled addhubtract (DCAS) cells than those of existing designs, it requires fewer latches. The number of latches constitutes a major part of the digit serial architecture, and therefore for small values of K the new structure requires a smaller area. Another significant result is that it is shown that the new architecture is the first reported digit serial structure which is faster than the conventional binary bit parallel realisation, and moreover it requires a smaller area.

Finally, the new algorithm offers the designers flexibility in the selection of the appropriate values of the number of levels K, the digit size n, the number of pipelining levels and the style of implementation. The selection of these parameters is influenced by speed, cost and throughput consideration.

2 New digit serial division algorithm

Consider the following division process, Q = AID, where it is assumed that A , D and Q are in two's complement form, IAl < IDI, and D * 0. In the radix-T digit serial division [3, 41, the (2N+l)-bit dividend A , the (N+l)-bit divisor D, and the (N+l)-bit quotient Q are rewritten in a radix-2" form, such that:

A = AoAl.. . Am-lOO.. . O

D = DODl. . . D,-1

Q = QoQi . . . Qm-i

(1)

( 2 )

(3) where each digit A,, Dl and Ql (for 1 = 0, ..., m-1) consists of n bits of A , D and Q, respectively.

In the original digit serial division algorithm of [3 , 41, m cycles are required to generate each quotient bit, qL, for i = 1, ..., N . The ith quotient bit qL is used to control the (i+l)th step of the algorithm such that an addition (subtraction) is carried out if qL = 0 (1). Therefore, the addition/subtraction operation of the (i+ l)th step cannot be started until the quotient bit qL of the ith step is generated. Therefore, all the digits of the ith remainder have to be delayed by m cycles before being proc- essed in the next step. The throughput rate of this algorithm is one quotient bit every m cycles, and the latency is mN cycles. The throughput rate can be increased by pipelining the architecture to the bit level, but the area will increase and the latency will become mnN (i.e. N2) cycles [4].

As will be described below, the new algorithm calcu- lates K quotient bits in m+K-l cycles rather than mK cycles. First, the basic idea of the new algorithm is presented for K = 2. Then the algorithm is generalised for any value of K, and an example will be given for K = 4.

319 IEE Proc.-Circuits Devices Syst., Vol. 143, No. 6, December 1996

2. I Two's complement division is carried out using either restoring or nonrestoring algorithms [6]. In the nonrestoring algorithm, the value of q1 in the ith step of the binary division is either 0 or 1, and hence the operation in the (i+l)th step is either addition or subtraction, respectively. It should be noted that qL is equal to l(0) if the current remainder PR, and the divisor D have the same (different) sign. The sign bit of PR, is simply its most significant bit (MSB), which is obtained after the carry propagates from the least significant bit (LSB).

The new algorithm presented here is based on the r a d i ~ - 2 ~ nonrestoring division algorithm [3]. The basic principle of the new algorithm is described for K = 2 [5] . In the new algorithm, instead of waiting m cycles until q, is generated, the (i+l)th step can be started using the generated digits of the remainder, PR,, with- out any additional delay. Since q, is not known yet, both operations in the (i+l)th step (i.e. addition and subtraction) should be carried out [5 ] .

The addition and subtraction are carried out in parallel, and two possible values of the remainder, PR,+I, at the (i+l)th step are generated. Once the quotient bit q, is obtained, it is used as a control mode to an n-bit multiplexer to select one of the two possible values of the remainder, PR,+l, such that PR,+l = 2PR, + D if q1 = 0 and PR,+l = 2PR, - D if qz = 1. Therefore, for K = 2, the ith and the (i+l)th steps are overlapped, and the two steps are carried out in m+l cycles [5] instead of 2m cycles as in the original algorithm [3, 41.

2.2 Generalised new algorithm In this Section, the generalisation of the new algorithm is presented such that K steps are overlapped together to generate K quotient bits (i.e. q,, ..., ql+K-l) in m+K-1 cycles instead of Km cycles. As explained earlier, N steps are needed to calculate N quotient bits, where one bit is calculated at each step. In the new algorithm, the N steps required to calculate the N bits of the quotient are divided into NIK stages, where each stage consists of K steps. Since one quotient bit is calculated at each step, K bits of the quotient, i.e. one digit, are generated at each stage. To illustrate how the K steps of each stage are overlapped, each stage is represented using a tree as shown in Fig. 1. The tree consists of K levels, where the kth level corresponds to the kth step of the stage, where k = 1, ..., K. If K = 1 then the new algorithm will reduce to that of [3, 41.

Basic principle of new algorithm The operation of all the stages of the new algorithm are identical. In what follows, we describe the operation for a single stage, which applies to all stages of the algorithm. Assume that the root (first level) of the tree of a particular stage corresponds to the ith step of the algorithm where the ith remainder PR, is generated. In this stage, the ith to (i+K-1)th steps of the algorithm are computed, and the kth level of the tree of this stage will represent the (i+k-1)th step of the algorithm, where Zk-' possible values of the (i+k-1)th remainder PR,+k_l are generated. For k = 2 of the stage, there are two possible values of the remainder PRLil. The selection of the correct values of PR,+l and its quotient bit qz+l is decided by the previous quotient bit q,, of the same stage. For k 2 2 of the stage, there are 2k-1 possible values of the remainder PRz+k-l, where the correct value of PRl+lc-l and its quotient bit q,+k-l, is decided by the previous quotient bits q,, ..., qr+1c-2 of the same stage.

We consider the case of K = 4 to describe the operation of a particular stage of the algorithm in more detail. At the first cycle of the stage, the ith step (which corresponds to the root of the tree, i.e. k = 1) of the algorithm is executed, where the least significant digit (LSD) of the remainder, PR,, is generated. The quotient bit of the ith step q, is generated at the mth cycle of the stage. Since q, has two possible values (i.e. either 1 or 0), both addition and subtraction operations should be carried out at the (i+l)th step of the algorithm (i.e. k = 2) to generate two possible values of the next remainder, PR,+l.

The LSDs of the two possible values of PR,+l are generated at the second cycle of the stage, and the possible values of the quotient bit qz+l are generated at the (m+l)th cycle. Since the values of qr and q,+l are not known yet, the addition and subtraction operations should be carried out on each of the two possible values of PR,+l at the (i+2)th step (i.e. k = 3). This step starts at the third cycle of the stage to generate four possible values of PR1+2. The possible values of the quotient bit q1+2 are generated at the (m+2)th cycle of the stage. Similarly, the (i+3)th step (i.e. k = 4) of the algorithm starts at the fourth cycle to generate eight possible values of PR,+3, and m cycles are required to generate all the digits of PR,+3. The LSD of the PR,,, is generated at the fourth cycle and the most significant digit (MSD) at the (m+3)th cycle. Therefore, the

91 2PR -D I

Fig. 1 Tree representation of new algorithm

320 IEE Proc.-Circuits Devices Syst.. Vol. 143, No. 6, December 1996

correct remainder PR,+3 and its quotient bit qr+3 can only be decided at the (m+3)th cycle when the values of q,, q,+l, and qr+2 of the same stage become available.

2.3 Selection of correct remainder The selection of the correct remainder and its quotient bit for K 2 2 will now be described. As an example, K is assumed to be equal to 4. Again the description will be made for a general stage. As explained earlier, at the ith step of the algorithm (i.e. k = I), the correct value of the quotient bit q1 will be available at the mth cycle of the stage. At the (i+l)th step (i.e. k = 2) this value can be used to control a 2 to 1-bit multiplexer to decide the correct value of ql+l which is generated at the (m+l)th cycle. The inputs to this multiplexer are the two possible values of q,+l generated from the two possible values of PR,+l. At the (i+2)th step (i.e. k = 3) there are four possible values of q,+2 which are generated at the (m+2)th cycle. Therefore, a 4 to I-bit multiplexer is required to select the correct value of q,+2 at the (m+2)th cycle, and the control mode for this multiplexer will be the previous quotient bits q, and gl+l, which are generated at the mth and (m+l)th cycles, respectively.

At the (i+3)th step (i.e. k = 4), there are eight possible values of PR,+3 and eight possible values of the quotient bit qlt3, which are only available at the (m+3)th cycle. In this case, an 8 to 1-bit multiplexer is required to select the correct value of q,+3 at the (m+3)th cycle. Also, for K = 4, an 8 to 1 n-bit multiplexer is needed to select the correct value of PR,+3 at the same cycle. The correct value of PR,+3 is then fed to the next stage of the algorithm. The control modes for these multiplexers are the previous quotient bits q,, ql+l and qLi2, which are available at the mth, (m+l)th, and (m+2)th cycles, respectively.

For general K , the (i+k-1)th quotient bit qr+k-l for k =1 , ..., K, is generated at the (m+k-1)th cycle of the stage, and the correct partial remainder at the output of each stage PR,+,l is selected at the (m+K-1)th cycle, which is used as the input partial remainder for the next stage. It is worth noting that we do not need to select the correct partial remainder at the intermedi-

ate steps (levels) of each stage since all that is required in the intermediate steps (levels) of each stage is to select the correct quotient bits only.

3 New divider architecture

3. 'I Basic cell The basic cell used to implement the new algorithm (for K = 4) is shown in Fig. 2. In this Figure, DCAS is an n-bit digital controlled addhubtract cell [3, 41 (which is either the fed backward cell (DCASB) or the fed for- ward cell (DCASF)). In general for K = 2, a tree structure of K levels of the basic cell is required. For instance, if K = 4 then a tree of four levels is required, as shown in Fig. 2. In the first level, one DCAS cell is required to generate the ith remainder PR,. In the second level, the two possible values of the (i+l)th remainder PR,+l,l and PR1+1,2 are generated using two DCAS cells. A 2 to 1-bit multiplexer (which is controlled by 4,) is required to generate ql+l.

The third and fourth levels consist of four and eight DCAS cells, respectively, to generate the possible values of PR,+2 and PR,,, (shown in the third and fourth levels of the tree of Fig. I). At the third level of the architecture, a 4 to 1-bit multiplexer is required to generate q,+2, and the control mode of this multiplexer will be q1 and q,+l. Similarly, an 8 to 1-bit multiplexer is required at the fourth level to generate q2+3, which is controlled by ql, q,+l and q,+2. An 8 to 1 n-bit multiplexer is required at the fourth level to generate the correct remainder PR,+3, and this multiplexer is con-

In general, the kth level of each cell, for k = 1, ..., K, consists of 2k-' DCAS cells to generate all the possible values of PRl+k-I, as well as a zk-l to 1 multiplexer to generate the (i+k-1)th quotient bit, ql+k-l. The control mode for the multiplexer at the kth level will be q,,,, for s = 1, ..., k-2, generated in the previous levels of the same stage. At the Kth level of the cell, a 2K-1 to 1 n-bit multiplexer is used to select the correct value of the partial remainder at the output of the cell, PRllk-,. This multiplexer is controlled by q,+$, fors = 1, ..., K-2.

trolled by 41, q ~ + l and qz+2.

I

I I I .

1 I I

0 DCAS cell D P R i + 3 0 shift 1 bit

t I one latch 0 (rn-1) latches

Fig.2 Basic digit serial cell (K=4)

IEE Proc-Circuits Devices Syst., Vol. 143, No. 6, December 1996 321

3.2 Divider architecture The implementation of the new algorithm can be totally combinational, totally sequential, or a combina- tion of both. In the totally combinational architecture, NIK basic cells are cascaded as shown in Fig. 3, where N = 32 and K = 4. In this Figure, eight basic cells of Fig. 2 are required, where each cell produces four bits of the quotient. In the totally sequential architecture, the same basic cell is reused to generate all the quotient bits as shown in Fig. 4. In this case, the output remainder will be delayed and then fed back at the input of the cell. The LSB of the generated digit qr+K is also delayed and fed back to the same cell to be used as a control signal C, to the additionhubtraction operation.

A D

a

basic cell-I

A D

a

basic cell-I

basic cell-2

m w q 2 5 q26 q27q28

basic cell-8 c5 '928

Fig. 3 Totally combinational divider architecture

... qi+k

- Fig. 4 Totally sequential divider architecture

qi q i + k

basic cell-2 .s=qi + k

I delay I Fig. 5 Combined divider architecture

322

In the combined implementation, L basic cells are cascaded where the resulting supercell generates KL bits of the quotient. This supercell is reused NIL times to generate all the bits of the quotient, as shown in Fig. 5 for L = 2 and K = 4. It is left to the designer to select one of these alternatives which can achieve the best tradeoff between speed, area and throughput.

4 Evaluation of architecture

The area per quotient bit of the new structure A,, the corresponding time T, and the area-time, AT, (for K = 2 and K = 4), are now compared with the corresponding area A,, time To, and the area-time AT, of the original architecture (K = 1) of [3, 41. In this evaluation it is assumed that both architectures are totally combinational and the fed-backward digital controlled addlsub- tract (DCASB) cell [3, 41 is used in the basic cell. It is also assumed that a carry propagate adder is used in the DCASB cell and the wordlength N is assumed to be equal to 32. It should be noted that A,, To and AT, of [3, 41 are similar to those of [l, 21. The area A, and time T, of the new algorithm are calculated as half (quarter) the area and time required to generate qL and 4r+l (4~2 4r+l, qr+2 and q ~ + 3 ) for K = 2 (4).

In these calculations it is assumed that an AND gate is equivalent to two NAND gates, an exclusive-OR (XOR) is equivalent to three NAND gates, and so on [6]. The units of the area A, and the time At represent the area required and the time taken by a NAND gate, respectively, where the values of A, and Ar depend on the technology used. For example, the ES2 (European silicon structures) typical value of A, using 1 . 0 ~ CMOS technology is 0.46ns.

The area, time and area-time, as a function of the digit size n, are shown in Figs. 6-8, respectively. It is clear from these graphs that the speed of the new architecture can be improved up to K times that of the original one. For K = 2 and IZ 5 8, the speed of the new structure is nearly twice that of the original one and moreover it requires less area, while less area-time can be achieved for any digit size n. For K = 4, the speed can be improved up to four times that of the original one, while the required area is less than twice that of the original structure.

The comparison of the new structure with the binary bit parallel one (n = 32) can also be seen in Fig. 6. While the original digit serial architecture is slower than the bit parallel one, the new architecture can be faster and more efficient than the binary bit parallel one. For instance, if K = 4 and n = 4, the speed of the new architecture is twice that of the bit parallel one and moreover it requires a smaller area.

1500i

1 0 1000 a 0 2

500

0 1 2 4 8 16 32

digit size Fig.6 Area of new architecture and of original architecture of [3, 41

IEE Proc.-Circuits Devices Syst., Vol. 143, No. 6, December 1996

500r 5 Conclusions

Fig.7

8 200 0 X -

150 0 a v

aJ 100 E c I O

0 pi 50

0

Fig.8 /3, 41

I L

original 0 (K=l)

K-2 K=4 bit parallel (n.32)

1 ,

4 8 16 32 digit size

Time of new architecture and of original architecture of [3, 41

1 2 Ill 4

original 0 (K=l)

K=2 K=4

16 32 digit size

4rea-Time of new architecture and of original architecture oj’

In this paper, a new digit serial division algorithm is presented. The evaluation of the new architecture has shown that the new algorithm is much more efficient than all the existing digit serial algorithms that are based on 2’s complement arithmetic. By comparing it with the conventional bit parallel one, the new algorithm is faster and requires a smaller area. In addition to the flexible method of implementation, the new algorithm offers the designers flexibility in selecting the appropriate values of the number of levels K, digit size n, and number of pipelining levels. The selection of these parameters is influenced by speed, cost and throughput requirements.

6 References

1 HARTLEY, R.I., and CORBETT, P.F.: ‘Digit-serial processing techniques’, ZEEE Trans. Circuits Syst., 1990, 37, (6), pp. 707-719

2 PARHI, K.K.: ‘A systematic approach for design of digit-serial signal processing architectures’, ZEEE Trans. Circuits Syst., 199 1,

3 BASHAGHA, A.E., and IBRAHIM, M.K: ‘A new digit-serial divider architecture’, Znt. J. Electron., 1993, 75, (l), pp. 133-140

4 BASHAGHA, A.E., and IBRAHIM, M.K.: ‘Radix digit serial pipelined dividerisquare root architecture’, ZEE Proc., Comput. Digit. Tech., 1994, 141, (6) , pp. 375-380 BASHAGHA, A.E., and IBRAHIM, M.K.: ‘A digit serial division algorithm’, Electron. Lett., 1995, 31, (12), pp. 659-661 HWANG, K.: ‘Computer arithmetic: principles, architecture and design’ (Wiley, New York, 1979)

38, (4), pp. 358-375

5

6

IEE Proc-Circuits Devices Syst.. Vol. 143, No. 6, December 1996 323

Date post:	20-Sep-2016
Category:	Documents
Upload:	mk
View:	214 times
Download:	1 times

High-radix digit serial division

Documents