Algorithm and architecture for a high density, low power scalar product macrocell

Algorithm and architecture for a high density,low power scalar product macrocell

J. Gu, C.-H. Chang and K.-S. Yeo

Abstract: The authors present a design approach for an arithmetic macrocell that computes thescalar product of two vectors, an operation ubiquitously present in the solution of manycommunications and digital signal processing problems. The core of the proposed architecture is afull combinational design containing a partial product generator, a partial product accumulator anda vector accumulator. The design addresses the competing optimisation goals of VLSI area, powerdissipation and latency in the deep submicron regime. Compared with conventional mergedarithmetic architectures, the proposed macrocell design represents a substantial improvement in theVLSI layout with little area wastage, a high degree of regularity and a good scalability for differentvector lengths and operand widths. A theoretical analysis shows that the design of a 16-bit scalarproduct multiplier for input vectors with 16 elements, in comparison with traditionally designedarchitecture, achieves a saving of 38.6% in the silicon area, an up to 73% increase in the area usageefficiency and a 29.4% saving in the interconnect delay. Post-layout simulations of the proposedcircuit, based on a 0.18mm CMOS process, show an average power dissipation of 64.96 mW and alatency of 6.92 ns at a standard supply voltage of 1.8 V, a superior performance for a single cycleinstruction in a high-speed, low voltage 16-bit digital signal processor operating at 144 MHz. Theuse of shorter interconnects and more equalised interconnect delays, leads to the power dissipationand delay incurred by the interconnects being substantially reduced. Post-layout simulation of ourproposed circuit at supply voltages ranging from 0.7 to 3.3 V shows a significant power reduction of6 to 13% over the pre-layout simulation results of the conventional design.

1 Introduction

Contemporary digital signal processing algorithms forimage processing and telecommunications applications areincreasingly dependent on matrix- and vector-like arith-metic where the same kind of simple or compoundoperations are carried out on different elements for two ormore vectors [1–3]. Examples of algorithms and appli-cations that require such computations in their implemen-tations are the discrete Fourier transform, the discrete cosinetransform and the wavelet transform used widely in speechand image processing applications [1, 4], and finite impulseresponse, infinite impulse response digital filtering andadaptive decision-directed least mean square algorithmsused in signal identification, waveform shaping, channelequalisation, and magnetic storage technology [5–9]. Scalar(or inner) product computation forms the basis of suchspecial purpose complex arithmetic that is inefficientlyhandled by software on the core central processing unit orgeneral-purpose digital signal processor. Therefore, thehardware-based implementation of scalar product compu-tations is not only desirable, but also indispensable if we areto achieve the processing capacity required by applications

such as real-time digital video and mobile communicationssystems. The trend in wireless communication systems hasbeen to move the analog-to-digital convertor closer to theantenna to do more functions digitally at an increasinglyhigher data rate. However, the processing speed and powerrequirements of the digital modem and channel equalisationfunctions are not achievable without some form of adedicated ASIC preprocessor [7–9]. Furthermore, thedevelopment of such a preprocessor or macrocell has beendeterred by the excessive use of silicon area and numerousinterconnections and interminable layout details to bemanaged and optimised manually. Because of the restrictedVLSI area, the design of such macrocells has often beenrestricted to small word widths and limited input vectorsizes.

Integrated circuit (IC) designers are earnestly seeking fastmultipliers and adders to enable the exploration of a higherdegree of temporal parallelism for vector operations througha deep pipelined multicycle architecture [2, 3, 10]. Manyresearchers have studied and investigated algorithms andhardware architectures for the implementation of the scalarproduct or similar calculations based on the classicalmultiply=accumulate algorithm [11–14]. The most com-monly used parallel tree multiplier in these implementationsis that of Wallace who introduced the first array multiplierusing a carry-save adder tree [15]. This structure has beenfurther improved by Weinberger using (4,2) compressors[3, 10]. Dadda has proposed a serial-input serial-outputpipelined unit for the scalar product computations [13].Breveglieri and Dadda have also proposed a parallel inputscalar product macrocell designed specifically to interfacewith the ST9 8-bit microcontroller [12]. Recently, Lin hasintroduced a reconfigurable inner product architecture,

q IEE, 2004

IEE Proceedings online no. 20040328

doi: 10.1049/ip-cdt:20040328

The authors are with the School of Electrical and Electronic Engineering,Nanyang Technological University, Nanyang Avenue, Singapore 639798,Singapore

Paper received 13th November 2002

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 2, March 2004 161

whose array dimensions and precision can be dynamicallyconfigured [14]. In order to achieve the configurability,more lines are needed to redirect the data, which increasesthe interconnect capacitances. Swartzlander has introducedan alternative structure of merged arithmetic to implementthe inner product function [3, 16]. A revised mergedarithmetic architecture was later proposed by Feiste andSwartzlander [17]. The former fully merged architecture hasan irregular global interconnection, which decreases thechip area usage and increases the nonrecurring engineeringcost to redesign an implementation for a different operandsize. The latter partially merged architecture achieves abetter structural regularity at the expense of absolute gatecount and speed, and is a compromise between the fullymerged and the conventional structure [17]. Since the basicmodules are based on a normal multiplier structure, which isirregular, the physical layout regularity is somewhat limited.The two-term merged multiplication adder proposed byChoe and Swartzlander [18] can be used to compute themulti-term scalar product, but the speed will be sacrificed.Wang et al. [19] proposed two methods to compress thepartial product columns. Their methods are aimed atreducing the area cost by the reallocation of stages toreduce the cross-stage interconnection of the Dadda tree byusing (3,2) counters. Their approaches are suitable fordesigning a single multiplier but the optimisation of cross-stage interconnection is not directly applicable to the designof large vector or matrix multipliers using higher inputcompressors without incurring a high interconnect density.

None of the above architectures have considered the non-trivial effect of the parasitic capacitance of the interconnectson the latency, power consumption and VLSI area insubmicron and deep submicron technologies [20, 21]. As ICdesigns move into the system-on-chip era, the continuousquest for more functionality and increasing emphasis ondedicated computational-intensive intellectual propertycores has resulted in aggressive downscaling of thetransistors to provide higher levels of integration. This hasthe positive effect of reducing the internal gate delay, but atthe same time, the parasitic capacitance of the interconnect-ing wires increases commensurately. It is observed that, fora 0:35 mm technology, the gate delay is about 100 ps, thelocal interconnection delay is about 150 ps, and the globalinterconnection delay is about 1000 ps. If the technologyadvances to 0:07 mm; the gate delay is decreased to 10 ps, thelocal interconnection delay is decreased to 50 ps, whereasthe global interconnection is increased to 6000 ps. Thecoupling capacitance between adjacent lines separated by aminimum spacing increases from 40 to 70% [20]. Sylvesterand Kuetzer [21] have reported that, for a 1 mm metal 1 line,the RC delay in a 0:5 mm technology was 15 ps as comparedto 340 ps in a 0:1 mm technology. It is a virtual certainty thatthe progress in process technologies will outpace anyimprovements in power supplies. In high performance,power-hungry VLSI and ULSI digital circuits, reducing theparasitic capacitance has emerged as a key design premisefor the reduction of dynamic power dissipation, whichaccounts for more than 90% of the total power in a CMOSdevice. The parasitic capacitance can be reduced by usingfewer and smaller devices as well as fewer and shorterinterconnects. Glitches and spurious transitions can also beminimised by equalising the delay paths to the gate inputs[22]. The incentive for power reduction through intercon-nect optimisation is underpinned by the fact that loweringthe supply voltage below the present 3.3 V comes withmoderate to severe speed penalties and reduces thetransistors’ capability to drive increased internal nodecapacitances due to complex interconnects.

For these reasons, the scalar product macrocell we nowpresent focuses on deriving the benefits of power, delay andarea reduction efficiency for emerging process technologiesfrom the inception of algorithmic changes to the architec-tural and physical design space explorations. The proposedalgorithm has led to the creation of a low power, highperformance bit parallel architecture for the computation ofscalar products. While the number of transistors required toimplement a macrocell with large input vector sizes andoperand width is inevitably high, it is mitigated by asignificant increase in VLSI area usage density in ourmethod. Our architecture possesses a high modularity andhierarchical regularity, enabling the use of short localinterconnections and thus simplifying the global intercon-nects. Unlike in merged arithmetic architectures, the signalarrival time at the inputs of each module is much easier toequalise to reduce glitches and spurious transitions. Thanksto the hierarchical structure, the amount of nonrecurringengineering work is small as only one-time customisation ofone partial product accumulator and one vector accumulatorare needed with the other partial product accumulatorsbeing reused. We illustrate the layout regularity by showingthe floor planning of the macrocell design which acceptstwo input vectors with 16 elements with a bit width of 16.Theoretical estimation and simulation results indicate thatthe proposed architecture is able to reduce the overallpower, area and delay requirements to a greater extent than aconventional architecture. Since it is essentially a combina-tional design, the proposed macrocell is suitable forimplementation as either a self-timed core or a synchronousmulticycle core with an appropriate bus interface to high-speed processors. With the issue of only a single instruction,the macrocell can execute vector operations autonomously,effectively reducing both the instruction count and cyclesper instruction factors of the processor.

2 Conventional algorithm for the VLSI design ofa vector multiplier

Let X ¼ ½x0; x1; . . . ; xN�1�T be an N-element vector andA ¼ ½aij� be an M � N matrix. The product vector Y ¼½ y0; y1; . . . ; yM�1�T is computed by:

y0

y1

..

.

yM�1

26664

37775 ¼

a00 a01 � � � a0½N�1�a10 a11 � � � a1½N�1�

..

. ... . .

. ...

a½M�1�0 a½M�1�1 � � � a½M�1�½N�1�

26664

37775

x0

x1

..

.

xN�1

26664

37775

ð1ÞThe elements of Y are:

ym ¼XN�1

n¼0

amnxn ð2Þ

On a standard processor that does not have a vectormultiplication instruction, the matrix-vector multiplicationis performed by a series of programmed multiply-accumu-late instructions issued sequentially. The computation of (2)can be sped up on hardware by streaming the flow of theoperands for pipelined operation. In synchronous arrayimplementation, the clock distribution problem and clock-skews due to different clock path lengths and loads need tobe addressed, whereas in a self-timed design, the queuecapacity has to be optimised to maintain the desiredpipelining period to avoid deadlock. We now develop analgorithm to accelerate the computation of (2) in a fullycombinational structure in order to reduce the number of

IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 2, March 2004162

intermediate stages and solve the clock distributionproblem.

In what follows, we will consider the basic algorithm forVLSI implementation of (2), which is a vector multipli-cation (scalar product), in a traditional approach. Theelements of both the matrix A and the vector X arerepresented in two’s complement form. Their word lengthsare assumed to be R and S bits, respectively.

amn ¼ �ðamnÞR�12R�1 þXR�2

r¼0

ðamnÞr2r ð3Þ

and

xn ¼ �ðxnÞS�12S�1 þXS�2

s¼0

ðxnÞs2s ð4Þ

where ðamnÞr is the rth bit of amn and ðxnÞs is the sth bit of xn:ðamnÞR�1 and ðxnÞS�1 are the most significant bits (MSBs) ofamn and xn; respectively.

Substituting (3) and (4) into (2), we obtain (5) and (6):

ym ¼XN�1

n¼0

�ðamnÞR�12R�1þXR�2

r¼0

ðamnÞr2r

" #

� �ðxnÞS�12S�1þXS�2

s¼0

ðxnÞs2s

" #

¼XN�1

n¼0

ðamnÞR�1ðxnÞS�12RþS�2þ

XR�2

r¼0

XS�2

s¼0

ðamnÞrðxnÞs2rþs

�ðamnÞR�12R�1XS�2

s¼0

ðxnÞs2s�ðxnÞS�12S�1

XR�2

r¼0

ðamnÞr2r

�

ð5Þ

¼XN�1

n¼0

ðamnÞR�1ðxnÞS�12RþS�2 þXR�2

r¼0

XS�2

s¼0


"

þXS�2

s¼0

1 � ðamnÞR�1ðxnÞsð Þ2sþR�1

þXR�2

r¼0

1 � ðamnÞrðxnÞS�1ð Þ2rþS�1 �XRþS�3

r¼S�1

2r �XRþS�3

s¼R�1

2s

#

ð6Þ

Using the identityPk�1

i¼j 2i ¼ 2k � 2j; the last two items ofthe constant accumulation are given by:

�XRþS�3

r¼S�1

2r �XRþS�3

s¼R�1

2s ¼ �2RþS�1 þ 2S�1 þ 2R�1 ð7Þ

These bits in (7) are the correction bits and are used toreduce the gate counts that are otherwise required for thesign extension bits of each partial product terms. Collec-tively, they are termed the correction vector.

Thus, (6) is can be expressed as:

ym ¼XN�1

n¼0

ðamnÞR�1ðxnÞS�12RþS�2 þ

XR�2

r¼0

XS�2

s¼0


þXS�2

s¼0

ðamnÞR�1ðxnÞs2sþR�1 þ

XR�2

r¼0

ðamnÞrðxnÞS�12rþS�1

�2RþS�1 þ2S�1 þ2R�1

�ð8Þ

A vector multiplier as implied by (8) requires a series ofinteger multipliers to produce the intermediate results andan accumulator to add them up. As an example, a 4 £ 4operands vector multiplier with a word length of four isshown in Fig. 1. Each symbol ‘*’ denotes a partial productbit, and the ‘*’s in the shaded region denote the bit to becomplemented. The symbols ‘p’ enclosed in rectangular boxdenote the intermediate result bits generated by each of theindividual multipliers operating in parallel. They are furthersummed to produce the final result bits as denoted by arectangular box of the symbols ‘s’.

The layout irregularity of the vector multiplier obtainedusing the traditional approach often leads to an excessivelylow efficiency in the VLSI area usage. Also, the inter-connect wire is long, which causes a longer delay in thecritical path and a higher power dissipation.

3 Proposed algorithm for the VLSI design of avector multiplier

In this Section, we propose an alternative approach to thedesign of a highly modular vector multiplier. The method isbased on the decomposition of the signed integer matrix of(1) into N weighted Boolean matrices. Weighted products ofthe input array elements gated by each Boolean matrixelements are generated followed by a carry-free accumu-lation. The advantage of this decomposition algorithm isthat the carry-propagation delay occurs only once at the veryend, instead of at each addition step, increasing theopportunities to explore for a higher density design with amore granular layout.

From (5):

ym ¼XN�1

n¼0

� ðxnÞS�12S�1

XR�2

r¼0

ðamnÞr2r

þXR�2

r¼0

XS�2

s¼0

ðamnÞrðxnÞs2rþs þ ðamnÞR�1ðxnÞS�12RþS�2

� ðamnÞR�12R�1XS�2

s¼0

ðxnÞs2s

�

Fig. 1 Traditional algorithm for the VLSI design of vectormultiplier


¼XN�1

n¼0

(XR�2

r¼0

1�ðamnÞrðxnÞS�1ð Þ2rþS�1

þXR�2

r¼0

XS�2

s¼0

ðamnÞrðxnÞs2rþs �

ð1�ðamnÞR�1ðxnÞS�1Þ2RþS�2

þðamnÞR�12R�1XS�2

s¼0

ðxnÞs2s � 2RþS�2

��

XRþS�3

r¼S�1

2r

)

¼XN�1

n¼0

XR�2

r¼0

ðamnÞrðxnÞS�12rþS�1þXR�2

r¼0

XS�2

s¼0


(

� ðamnÞR�1ðxnÞS�12RþS�2þðamnÞR�12R�1XS�2

s¼0

ðxnÞs2s

" #

þ 2RþS�2�XRþS�3

r¼S�1

2r

)ð9Þ

A regular array of accumulation units will not only utilisethe silicon area more efficiently, but also reduce the glitchescaused by the unequal delay paths of the input signals to theaccumulation units. In order to obtain a regular layout of theaccumulation units, we swap the accumulating order of rand n.

ym¼XR�2

r¼0

XN�1

n¼0

ðamnÞrðxnÞS�12rþS�1þXS�2

s¼0


" #( )

�2R�1XN�1

n¼0

ðamnÞR�1ðxnÞS�12S�1þXS�2

s¼0

ðamnÞR�1ðxnÞs2s

" #

þXN�1

n¼0

2S�1 ð10Þ

The width of xn is S bits. Let � ¼ dlog2 Ne be the smallestinteger larger than or equal to log2 N; the expression:

XN�1

n¼0

ðamnÞR�1ðxnÞS�12S�1 þXS�2

s¼0


" #

can always be represented in exactly ðS þ �Þ bits. Let theresult be expressed as XSþ��1

i¼0

zð Þi2i ð11Þ

Substituting (11) into (10), we have:

ym¼XR�2

r¼0

XN�1

n¼0


s¼0


" #( )

�2R�1XSþ��1

i¼0

zð Þi2iþN2S�1

¼XR�2

r¼0

XN�1

n¼0


s¼0


" #( )

þ2R�1XSþ��1

i¼0

�1� zð Þi

�2iþN2S�1�2R�1

XSþ��1

i¼0

2i

¼XR�2

r¼0

XN�1

n¼0


s¼0


" #( )

þ2R�1XSþ��1

i¼0

zð Þi2iþN2S�1�2R�1 2Sþ��1

� �ð12Þ

Comparing (8) and (12), we find that (8) needs NðS � 1Þ þNðR � 1Þ inverters to complement the bits while (12) needsðS þ �Þ þ NðR � 1Þ inverters. To illustrate the significanceof this implication, consider the case where N ¼ 16 andS ¼ R ¼ 16; then � ¼ log2 N ¼ 4: Equation (8) requires 480inverters whereas (12) requires only 260 inverters, whichrepresents a saving of 45.8% in the number of inverters andpotential spurious transitions.

In terms of VLSI layout, every arithmetic module isregular in shape. Using the same example and notations asin Fig. 1, the structural differences of the proposed methodare shown in Fig. 2.

4 Layout oriented architecture design of thescalar product macrocell

Previous attempts to design a dedicated macrocell toaccelerate the computation of a scalar product have oftenbeen thwarted by limitations on the silicon area and therestricted routing channels available for the interconnectwires. This Section and the following Section address theunique advantages of our algorithm when it is translated intohardware architecture, particularly when the design is to beimplemented in deep-submicron technology where thesignal propagations and transitions are dominated by theinterconnect delay. We will consider the design example ofa scalar product macrocell, operating on two input vectors.Each input vector consists of 16 signed integer elements intwo’s complement form. The word length of each element is16. In other words, N ¼ 16; and R ¼ S ¼ 16 for (12).

The architecture of the scalar product macrocell is mainlycomposed of three parts, namely the partial productgenerator, the partial product accumulator, and the vectoraccumulator, as shown in Fig. 3. For convenience, they areabbreviated as PPG, PPA and VA, respectively.

The PPG generates the partial product from the inputvector X according to the bit value of ðanÞr: It is composedof a row of AND gates, except for the most significant bit,which is generated by a NAND gate because the MSB needsto be complemented.

The PPA sums up all the partial products of the sameweight and generates the intermediate result. This makes theaccumulating circuit form a rectangle structure, as shown in

Fig. 2 Proposed algorithm for the VLSI design of a vectormultiplier


Fig. 4. Such a structure permits the optimisation of routingchannels with short equal interconnects between neighbour-ing arithmetic cells.

The accumulator is realised by a Wallace tree structure[3, 10, 15, 23], which consists of three layers of (4, 2)compressors [3, 10]. Instead of summing to a final result thatoften needs a time-consuming carry propagating adder(CPA), every accumulator just produces a stored-carryintermediate result generated by the third layer of (4, 2)compressors, as shown in Fig. 5. The elimination of theinternal CPAs by keeping the intermediate sum in stored-carry format will reduce the computing time and number ofglitches.

The VA sums up all the 16 stored-carry intermediateresults generated by the PPA’s, and is equivalent to 32binary numbers. These stored-carry numbers are weighteddifferently and their dot notation forms a parallelogram asshown in Fig. 6. All the bits in the stored-carry sum of thehighest weight are negated before the accumulation. Theletter ‘e’ in the Figure denotes the correction bits. Comparedto the other 16 regular PPAs, this single irregularity of theVA is trivial. Moreover, the 16 PPAs cover most of the areaof the chip, and the arithmetic circuits of the VA are

interleaved uniformly over the PPAs (refer to the floorplanning in Fig. 10 of Section 5), making the irregularityalmost indiscernible.

The vector accumulator also employs the Wallace treestructure with four layers of (4,2) compressors, as shown inFig. 7. The correction vector is added to the last layer of(4,2) compressors to form (5,2) compressors. The final 36-bit result is computed by a CPA.

Since we use the stored-carry format as the intermediateresult representation, the correction vector will be differentfrom that given by (7). Hence, (11) is modified to (13) asfollows:

XN�1

n¼0

ðamnÞR�1ðxnÞS�12S�1 þXS�2

s¼0


" #

¼XSþ��2

i¼0

sð Þi2i þ

XSþ��2

i¼1

cð Þi2i ð13Þ

where the first and second term of (13) are the sum and carrycomponents of the stored-carry representation of the binarynumber given by (11).

Substituting (13) into (10), we have:

ym¼XR�2

r¼0

XN�1

n¼0


s¼0


" #( )

þ2R�1XSþ��2

i¼0

1� sð Þið Þ2iþXSþ��2

i¼1

1� cð Þið Þ2i

" #

þN2S�1�2RþSþ��1þ2R�1þ2R

ð14Þ

The correction vector is N2S�1 � 2RþSþ��1 þ 2R�1 þ 2R:In this case, it is 235 þ 219 þ 216 þ 215:

Fig. 4 Rectangular structure of the PPA

Fig. 5 Wallace tree structure of the PPA

Fig. 6 The parallelogram structure of the VA

Fig. 3 The proposed architecture for the scalar productmacrocell

Fig. 7 The Wallace tree structure of the VA


5 Floor planning and area usage estimation

Owing to the large number of transistors involved in thedesign, it is extremely difficult to make a custom layoutof the entire circuit. Fortunately, the architecture of thescalar product macrocell is highly modular, making itpossible to handcraft the physical design hierarchically.We arrange and optimise the leaf cells first, followed bytheir assembly into functional blocks. The functionalblocks are abutted to form the complete circuit. The goalof our floor planning is to make the interconnecting wiresas short as possible and the difference in length of theinterconnecting wires of adjacent layers of the Wallacetree as small as possible.

The main leaf cells of our example are (4,2) compressors,(5,2) compressors, full adders and half adders. In order tomake a valid floor planning, we need to estimate the sizesand shapes of these leaf cells.

Transistors in an IC have different sizes. For example,transistors in parallel are sized differently from those inseries; p-type transistors are sized differently from n-type;transistors, also transistors in the critical path are sizeddifferently from those in the non-critical path, etc. Tosimplify the estimation of the sizes of the leaf cells and thefloor planning of the entire design, we assume that theaverage size of a MOS transistor is ST ¼ h � w; where h isthe height and w is the width, as shown in Fig. 8. Two orthree transistors in series are physically designed as a dual ortriple gate transistor. This can be treated as having the sameaverage dimension as of one transistor because it onlyoccupies an area only a little larger than that of a singletransistor. More than three transistors in series are avoideddue to their low speed. Figure 8 shows the layout process ofthe leaf cell from the individual transistor, XOR gate, fulladder to (4,2) compressor. The ‘NN’ and ‘PP’ blocks in

Fig. 8 are two MOS transistors in series, where ‘N’ standsfor NMOS and ‘P’ stands for PMOS.

Since a (4,2) compressor is composed of two full adders,in order to minimise the length of the interconnecting wires,we arrange the two full adders side by side. The (4,2)compressor in Fig. 8 covers an area of S42 ¼ 4h � 18w ¼72 ST: The reduced version of the (4,2) compressor iscomposed of a half adder and a full adder. It is used whereonly four instead of five bits are to be added.

The floor planning of the PPA and the PPG is shown inFig. 9. Every rectangular block labelled with a number isa column of (4,2) compressors. Each (4,2) compressor isconstructed as in Fig. 8. The number refers to the layernumber of the block in the Wallace tree. For example, thefirst layer of blocks labelled ‘1’ are those blocks whoseinputs are fed directly from the PPG. The second layer ofblocks labelled ‘2’ receives its input from the output ofblock ‘1’. Thus, block ‘1’ has 16 (4,2) compressors, block‘2’ 17 compressors, and block ‘3’ 18 compressors. Theirareas are 64h � 18w; 68h � 18w and 72h � 18w;respectively.

The PPG are arranged along both sides of block ‘1’ tominimise the connecting wires to the (4,2) compressors. Theelement of the partial product generator is an NAND gatefollowed by a NOT gate (except the MSB), which is of sizeSAND ¼ 2h � 3w: Every block ‘1’ needs four 16-bit partialproduct generators, equivalent to 64 elements. Each side ofa partial product generator consists of 32 elements, with awidth of 3w and a total length of 64 h, similar to the overalldimension of block ‘1’.

The blocks ‘1’ are interpolated by blocks ‘2’ or ‘3’ due tothe large number of wires connecting blocks ‘1’ and ‘2’ or‘2’ and ‘3’. Such an arrangement would reduce the totallength of the wires as much as possible, minimising theswitching power and glitches caused by the stray capaci-tance of the wires.

The PPA and PPG in Fig. 9 cover an area of SPPA ¼72h � 150w ¼ 10 800 ST: However, the area is not fullyutilised. About ð24w � 2hÞ � 4 þ ð18w � hÞ � 2 ¼ 228ST

is unused. The VLSI area usage ratio is:

1 � 228

10 800

�¼ 97:9%

The vector multiplier needs 16 such PPAs and PPGstogether with one VA. The complete floor planning of themacrocell is shown in Fig. 10. The roman number labelledblocks belong to the VA, where the numbers refer to thelayer numbers in the Wallace tree.

The width of layer ‘i’ is 19 bits, layer ‘ii’ 20 bits, layer‘iii’ 24 bits and layer ‘iv’ 31 bits. The (3,2) counter layer and

Fig. 9 Floor planning of the PPA and PPG

Fig. 8 The layout hierarchy of a (4,2) compressor


CPA layer are of 31 bits each. The vector multiplier needs:

SVM ¼ 320h � 708w ¼ 226 560 ST ð15Þof the silicon area. The used silicon area is:

ð10 800 � 228Þ � 16 þ 72 � ð19 � 8 þ 20 � 4

þ 24 � 2 þ 31 þ 31Þ ¼ 193 776 ST: ð16Þ

Therefore, the total area usage of the scalar productmacrocell is:

193 776

226 560¼ 85:5%

To make a fair comparison and to illustrate the advantage ofour proposed method, the architecture offered by theconventional method is laid out using the same leaf cellswith the same floor planning goal. A 16-bit� 16-bitstandard multiplier is the main building block of thestraightforward approach to implement the architecture ofthe scalar product macrocell. Its layout is shown in Fig. 11.

Layer ‘1’ has 17 bits, layer ‘2’ 20 bits, layer ‘3’ 29 bitsand layer ‘CPA’ 28 bits. It is evident that the layout of the

multiplier is irregular if the aim of the design is to minimisethe delay of the interconnecting wires. Regulating the layoutwill make both the local and the global interconnectingwires longer and their routing interlacing irregular, thusincreasing the imparity of the signal arrival time and theglitch power.

The multiplier in Fig. 11 covers an area of:

30 � 4h � ½ð18Þ � 7 þ ð3 þ 3Þ � 4 þ 9�w¼ 120h � 159 w ¼ 19 080 ST ð17Þ

The used area is:

ð17 � 4 þ 20 � 2 þ 29Þ � S42 þ 16h � 3w � 8 ¼ 10 248 ST

ð18Þ

The area usage of the multiplier is:

10 248

19 080¼ 53:7%

The floor planning of the whole macrocell for theconventional architecture is shown in Fig. 12. The romannumber labelled blocks finish the accumulation of all theintermediate products. Block ‘i’ is of 33 bits, block ‘ii’ 34bits, block ‘iii’ 35 bits and the CPA block beside block ‘iii’is of 36 bits.

The whole macrocell covers an area of ð33 � 4 � 4Þh �ð159 � 4 þ 18 � 3 þ 9Þw ¼ 528h � 699w ¼ 369 072 ST:

The used area is 10 248 � 16 þ ð33 � 4 þ 34 � 2 þ 35Þ�S42 þ 35 � SFA ¼ 182 148 ST :

Therefore, the VLSI area usage is:

182 148

369 072¼ 49:4%

From the layout parameters of the conventional and theproposed architectures summarised in Table 1, it is evidentthat the proposed architecture has advantages over theconventional one. There is a saving of 38.6% in the usedsilicon area and up to 73% increase in the efficiency of thearea usage.

Fig. 10 Floor planning of the proposed bit-parallel scalar product macrocell

Fig. 11 Floor planning for a normal 16-bit £ 16-bit multiplier


6 Delay estimation

The computation time or the delay is mainly caused by thegate and interconnection delays [20–22]. For both archi-tectures, there is little difference in the gate delay. However,the interconnection delay, especially the one for the vectoraccumulator, varies significantly. We compute the delayfrom the PPG to the final result.

The gate delay of the proposed architecture is composedof three parts, namely the delays for the PPG, PPA and VA.The gate delay of the PPG is tPPG ¼ tAND: Since the PPA has½ðlog2 nÞ � 1� layers of (4,2) compressors, its gate delay istPPA ¼ ½ðlog2 nÞ � 1�t42; where n is the word length of theinput vector element, and t42 is the longest delay of the (4,2)compressor, which is 3tXOR for the circuit shown in Fig. 8.Therefore, tPPA ¼ 9tXOR:

The critical path of the VA is composed of two (4,2)compressors in the first two layers, one full adder in the thirdlayer, one half adder in the fourth adder and two half adderstogether with 29 full adders in the CPA layer. In thecompressor and counter layers, tFA ¼ 2tXOR; tHA ¼ tXOR: Inthe CPA layer, t 0FA ¼ 2tNAND; t0HA ¼ tNAND: Therefore, thedelay of the vector accumulator is given by:

tVA ¼ 2t42 þ tFA þ tHA þ 2t0HA þ 29t0FA ¼ 9tXOR þ 60tNAND

ð19Þ

The interconnection delay is technology dependent. It isproportional to the stray capacitance of the connecting wire.Since the width of the wire, d; is normally fixed by thefeature size of the process, the capacitance is proportional tothe length of the wire, L: Therefore, the interconnectiondelay is also proportional to L: Let k be the constant ofproportionality for a given technology, the delay for the wireis then given by:

twire ¼ kdL ð20ÞThe maximum length of the connecting wire in the criticalpath can be expressed as:

Lmaxim¼L1�2þL2�3þL3�iþLi�iiþLii�iiiþLiii�ivþLiv�CPA

ð21Þwhere the arabic numbers and the roman numbers refer tothe layer numbers in the Wallace tree of the partial productaccumulators and vector accumulators, respectively. From

Fig. 12 Floor planning for the conventional scalar product macrocell

Table 1: Characteristics of floor planning for the twoarchitectures

Proposed Conventional

Layout area height 320h 528h

Layout area width 708w 699w

Total area 226 560 ST 369 072 ST

Area usage, % 85.5 49.4

Regularity of blocks regular irregular


Fig. 10, the layers ‘1’ and layers ‘2’ of the partial productaccumulator are spaced out by a column of partial productgenerator, so L1�2 ¼ 3w: L2�3 ¼ 24w; L3�i ¼ 66w; Li�ii

¼ 150w; Lii�iii ¼ 0:5 � 68h ¼ 34h; Liii�iv ¼ 68h; Liv�CPA

¼ 0: Therefore, Lmaxim ¼ 243w þ 102h:The worst-case delay of the proposed architecture is:

Tproposed ¼ tgate þ twire ¼ tPPG þ tPPA þ tVA þ kdLmaxim

¼ 18tXOR þ 60tNAND þ tAND þ kdð243w þ 102hÞð22Þ

In the conventional architecture, the critical path in one16-bit multiplier is composed of one partial productgenerator, one (4,2) compressor in the first layer, one fulladder in the second layer, one half adder in the third layerand one full adder in the multiplier CPA layer.

The delay of the multiplier is given by:

tmul ¼ tPPG þ t42 þ tFA þ tHA þ tFA ¼ 8tXOR þ tAND ð23Þ

The critical path in the intermediate product accumulatorincludes three layers of (4,2) compressors and 31 full addersin the CPA layer. Its delay is:

tPA ¼ 3t42 þ t0FA ¼ 9tXOR þ 62 tNAND ð24Þ

From Fig. 12, the interconnection delay can be computed as.

Lmaxim ¼ L1�2 þ L2�3 þ L3�CPA þ LCPA1�i

þ Li�ii þ Lii�iii þ Liii�CPA2 ð25Þ

where L1�2 ¼ 3w; L2�3 ¼ 3 � 3w ¼ 9w; L3�CPA1 ¼ 0;LCPA1�i ¼ ð18 � 3 þ 3 � 4 þ 159Þw ¼ 225w; Li�ii ¼ 0:5�33�4h ¼ 66h; Lii�iii¼33�4h¼132h and Liii�CPA2 ¼0:

Substituting these values into (21), we have:

Lmaxim ¼ 237w þ 198h ð26ÞTherefore, the worst-case delay of the conventionalarchitecture is:

Tconventional ¼ tmul þ tPA þ twire ¼ 17tXOR þ 62 tNAND

þ tAND þ kd · ð237w þ 198hÞð27Þ

For a 0:18 mm CMOS technology, the typical size of atransistor is about 3 mm in height by 1:5 mm in width. Theinterconnecting delay ratio of the proposed architecture tothe conventional architecture is:

243 � 1:5 þ 102 � 3

237 � 1:5 þ 198 � 3¼ 70:6%

a saving 29.4% on the delay from the interconnecting wire.The results summarised in Table 2 show that although the

gate delay is almost the same for the two architectures, theinterconnect delay varies significantly. It should be notedthat the good floor planning alternative and architecture in

Table 2: Estimation of worst-case delay

Proposed Conventional

Gate delay 18tXOR þ 60tNAND þ tAND 17tXOR þ 62tNAND þ tAND

Interconnecting

delay k� ð243w þ 102hÞ k� ð237w þ 198hÞ

Fig. 13 Layout of the proposed architecture


this case is an outcome of the algorithm introduced inSection 3.

7 Pre- and post-layout simulation results of theproposed architecture

Figure 13 shows the layout of the proposed architecture. Itoccupies a square silicon area of only about 1300 �1300 mm: The utilisation efficacy of the silicon area isconspicuously high. Based on the estimated area occupancygiven by Table 1, the proposed architecture covers an areaof 320h � 708w ¼ 960 � 1062 mm; which is close to thephysical layout. The discrepancy is primarily due to theoverhead of the global bus lines, which feed the operand bitsinto the PPG.

Three types of simulation of the two circuit architecturesare made, namely the pre-layout circuit simulations of theconventional and the proposed architectures, and the post-layout simulation of the proposed circuit. The circuitsimulator used is Synopsys Powermill version 5.3. All thecircuits are simulated at supply voltages ranging from 0.7to 3.3 V under the latest Chartered CSM 0:18 mm CMOStechnology. For pre-layout simulations, the architecture(high level structure) of the circuit is described in Veriloghardware description language and its gate level andtransistor level (low level circuit) descriptions are writtenin a SPICE format. For the post-layout simulation, thenetlists, together with their parasitic parameters areextracted from the layout of the circuit. For a faircomparison, both architectures are constructed from thesame elementary cells and simulated with the same modelfile. The transistor models are from the CSM 0:18 mmprocess library file revision 1G for Star-HSPICE level 53.The resolution of the simulation time is 0.01 ns. Thefrequency of computation (the rate at which data are inputfor simulation) is 50 MHz for supply voltages higher than1 V, and 10 MHz for supply voltages below or equal to 1 V.The 1024 input data patterns are generated randomly usingMatlab. All circuits are fed with the same input data stimuli.The average power dissipation at each voltage is determinedfrom the measured power supply current averaged over theentire sequence of data inputs. The worst-case propagationdelay is taken to be the longest delay of the output from theinput among all computations. The power efficiency isdefined as the product of the average power dissipation andthe worst-case delay.

The final implementations of the conventional andproposed vector multiplier circuits use 189 736 and182 990 transistors, respectively. The difference is due tothe algorithmic and architectural improvement as both

circuits use the same fundamental adder and compressorcells. The power dissipation, worst-case delay, and powerefficiency at different voltages are measured and shown inTable 3. The pre-layout simulation results of the conven-tional circuit and the proposed circuit are denoted as I and IIrespectively, and the post-layout simulation results of theproposed circuit is denoted as III.

To highlight the differences between the pre- and post-layout results of our proposed circuit and its performancegain over the conventional implementation, the figure ofmerit for the three simulations are plotted against supplyvoltages in Figs. 14 to 16. Since the power dissipation isproportional to the simulation frequency and the worst-casedelay is unaffected by data rate, to account for the differentsimulation frequencies used for the two supply voltageranges, the average power and power efficiency areproportionally scaled down by 5� when the supply voltageis higher than 1.0 V.

The degradation in power dissipation and critical delay ofsimulation III compared to those of simulation II is expectedbecause of the dominance of the interconnect couplingcapacitances and RC parasitic capacitances in advancedsubmicron technology. However, the severity of thedegradation is far lower than anticipated due to theimproved layout offered by our interconnect-centric designmethodology. In fact, the post-layout circuit of our proposeddesign has remarkably outperformed the pre-layout con-ventional circuit in terms of both power dissipation andpower efficiency. The worst-case delay of our post-layout

Table 3: Power, delay and power efficiency

Voltage, V 0.7 0.8 0.9 1.0 1.2 1.5 1.8 2.5 3.3

Data rate, MHz 10 50

Power, mW I 1.968 2.410 2.947 3.542 25.96 44.33 72.07 180.1 402.6

II 1.586 1.859 2.277 2.682 19.66 33.99 56.81 148.3 341.6

III 1.845 2.152 2.574 3.144 23.39 39.48 64.96 166.1 379.8

Delay, ns I 68.17 40.63 28.74 20.85 13.11 8.949 7.226 5.080 4.202

II 66.29 44.18 29.18 20.30 12.91 8.947 6.756 5.005 3.937

III 72.82 44.21 29.70 20.62 13.33 10.00 6.919 5.162 4.259

Power efficiency, pJ I 134.2 97.92 84.70 73.85 340.3 396.7 520.8 914.9 1692

II 105.1 82.13 66.44 54.44 253.8 304.1 383.8 742.2 1345

III 134.3 95.14 76.45 64.83 311.7 394.8 449.5 857.4 1618

Fig. 14 Comparison of the power dissipation for the threesimulations


circuit is comparable to the pre-layout circuit of theconventional architecture, which omits the wire capaci-tances. Based on the layout of our proposed circuit, it wouldtake probably more than four man-months to complete thefull custom layout of the conventional design with irregularinterconnections. Since the post-layout results of our actualcircuit have already exceeded the pre-layout results of theconventional design, the physical layout of the conventionaldesign is not conducted. If the parasitic parameters of theactual circuit of the conventional design are available, it isbelieved that the performance gain of our circuits will bemore salient.

8 Conclusions

An algorithm for the design of a VLSI circuit for scalarproduct evaluation has been presented. The algorithm hasproduced a full bit-parallel architecture for a scalarproduct macrocell featuring a low interconnect complex-ity, an improved power efficiency and a highly efficientVLSI area utilisation. More importantly, the layoutregularity and scalability enhance its performance super-iority in the deep submicron regime well above that of aconventional VLSI design for a vector processing unit forscalar product multiplication. The arithmetic core of the

macrocell consists of a partial product generator, a partialproduct accumulator and a vector accumulator. The floorplanning of the proposed architecture exploits the binarydata locality through the border between the multipli-cation and accumulation operations based on a fullcombinational logic implementation. A comparison withthe layout produced by a conventional vector multipliershows that our proposed decomposition algorithm has ledto a more compact, regular and modular physical design.A theoretical model for estimating the area and delay hasbeen formulated. Compared with a conventional archi-tecture with the same capacity, the estimation shows thatour design of a 16-bit scalar product multiplier with inputvectors of 16 elements achieves a saving of 38.6% insilicon area, an up to 73% increase in area usageefficiency and a 29.4% saving in interconnect delay. Theoverall performances of the average power consumption,the worst-case delay and the power efficiency of our post-layout circuit surpass that of the pre-layout circuit using aconventional architecture when these circuits are simu-lated using Synopsys Powermill and HSPICE over supplyvoltages ranging from 0.7 to 3.3 V based on 0:18 mmCMOS technology. The relatively small deviationsbetween the pre- and post-layout simulation resultsvalidate the inference from the theoretical estimationthat the key contributors to the delay and power reductionof our proposed architecture are the shorter and balancedglobal and local interconnecting wires, a dominant factorin design considerations for VLSI circuits fabricatedusing deep submicron technology.

9 References

1 Nayak, S.S., and Meher, P.K.: ‘High throughput VLSI implementationof discrete orthogonal transformation using bit-level vector-matrixmultiplier’, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.,1999, 46, (5), pp. 655–658

2 Okamoto, F., Hagihara, Y., Ohkubo, C., Nishi, N., Yamada, H., andEnomoto, T.: ‘A 200-MFLOPS 100-MHz 64-b BiCMOS vector-pipelined processor (VPP) ULSI’, IEEE J. Solid-State Circuits, 1991,26, (12), pp. 1885–1893

3 Parhami, B.: ‘Computer Arithmetic, Algorithms and HardwareDesigns’ (Oxford University Press, New York, 2000)

4 Grgic, S., Grgic, M., and Zovko-Cihlar, B.: ‘Performance analysis ofimage compression using wavelets’, IEEE Trans. Ind. Electron., 2001,48, (3), pp. 682–695

5 Hasan, Y.M., Karam, L.J., Falkinburg, M., Helwig, A., andRonning, M.: ‘Canonic Signed Digit Chebyshev FIR Filter Design’,IEEE Signal Process. Lett., 2001, 8, (6), pp. 167–169

6 Muhammas, K.: ‘Speed, power, area and latency tradeoffs in adaptiveFIR filtering for PRML read channels’, IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., 2001, 9, (1), pp. 42–51

7 Wong, C.S.H., Rudell, J.C., Uehara, G., and Gray, P.R.: ‘A 50 MHzeight-tap adaptive equalizer for partial-response channels’, IEEEJ. Solid-State Circuits, 1995, 30, (3), pp. 228–234

8 Shanbhag, N.R., and Parhi, K.K.: ‘Relaxed look-ahead pipelined LMSadaptive filters and their application to ADPCM coder’, IEEE Trans.Circuits Syst. II, Analog Digit. Signal Process., 1993, 40, (12),pp. 753–766

9 Gunn, J.E., Barron, K., and Ruczczyk, W.: ‘A low-power DSP core-based software radio architecture’, IEEE J. Sel. Areas Commun., 1999,17, (4), pp. 574–590

10 Vai, M.M.: ‘VLSI Design’ (CRC Press, Boca Raton, FL, 2001)11 Gu, J., Chang, C., and Yeo, K.: ‘An interconnect optimized floor-

planning of a scalar product macrocell’. Proc. IEEE Int. Symp. onCircuits Syst., Scottsdale, AZ, 26–29 May 2002, 1, pp. 465–468

12 Breveglieri, L., and Dadda, L.: ‘A VLSI inner product macrocell’,IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 1998, 6, (2),pp. 292–298

13 Dadda, L.: ‘Fast serial input serial output pipelined inner product units’,Internal Rep. 87-031, Dep. Elec. Eng. Inform. Sci. Politecnico diMilano, Milano, Italy, 1987

14 Lin, R.: ‘Reconfigurable parallel inner product processor architectures’,IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2001, 9, (2),pp. 261–272

15 Wallace, C.S.: ‘A suggestion for fast multipliers’, IEEE Trans.Comput., 1964, 13, pp. 14–17

16 Swartzlander, E.E., Jr.: ‘Merged Arithmetic’, IEEE Trans. Comput.,1980, 29, pp. 946–950

17 Feiste, K.A., and Swartzlander, E.E., Jr.: ‘Merged arithmetic revisited’.

Fig. 15 Comparison of the worst-case delay for the threesimulations

Fig. 16 Comparison of the power efficiency for the threesimulations


Proc. Workshop on Signal Processing Systems, Leicester, UK, 3–5November 1997, pp. 212–221

18 Choe, G., and Swartzlander, E.E., Jr.: ‘Complexity of merged two’scomplement multiplier-adders’. Proc. 42nd Midwest Symp. on Circuitsand Systems, Las Cruces, NM, 8–11 August 2000, 1, pp. 384–387

19 Wang, Z., Jullien, G.A., and Miller, W.C.: ‘A new design technique forcolumn compression multipliers’, IEEE Trans. Comput., 1995, 44, (8),pp. 962–970

20 Katkrori, S., and Alupoaei, S.: ‘RT-level Interconnect Optimisation in

DSM Regime’. Proc. IEEE Computer Society Workshop on VLSI,Orlando, FL, 27–28 April 2000, pp. 143–148

21 Sylvester, D., and Kuetzer, K.: ‘Getting to the bottom of deepsubmicron’. Proc. Int. Conf. on Computer-aided Design, San Jose, CA,8–12 November 1998, pp. 203–211

22 Nannarelli, A., and Lang, T.: ‘Low-power divider’, IEEE Trans.Comput., 1999, 48, (1), pp. 2–14

23 Dadda, L.: ‘Some Schemes for Parallel multiplier’, Alfa Frequenza,1965, 34, pp. 349–356


Date post:	20-Sep-2016
Category:	Documents
Upload:	k-s
View:	216 times
Download:	3 times

Algorithm and architecture for a high density, low power scalar product macrocell

Documents