[IEEE 2012 20th Iranian Conference on Electrical Engineering (ICEE) - Tehran, Iran...

20th Iranian Conference on Electrical Engineering, (ICEE2012), May 15-17,2012, Tehran, Iran

A Low-Power Low-Area Architecture Design for Distributed Arithmetic (DA) Unit

S. F. Ghamkhari, Shahed University

School of Engineering, Shahed University Shahed University of Tehran

Tehran, Iran [email protected]

Abstract- In this paper an improved DA architecture is

proposed. In the proposed DA, the high power consumption

adder units are relocated in the system to lower the switching

activity and total power. The proposed DA exploits the circuit

activity and the adder units are only used just in minimum

states. The designed DA is a run-time reconfigurable. The

design is verified, and simulation results via 2-phase power

calculations based on forward synthesis invariant approach

and back ward synthesis oriented activity approach is used to

calculate the power and area of the proposed DA and all

known counterparts. In the experimental results on 180n

CMOS ASIC synthesis the maximum clock of 180 MHz is

achieved. In the 5-tape FIR filter implementation of our

proposed DA with clock gating enabled and best known LUT

Less2, the dynamic power and area improvements are 39.76%

and 16.35% respectively.

Keywords- Multiply-Accumulator-Circuit (MAC); Distributed

Arithmetic (DA); Finite-Impulse-Response (FIR) Filter.

I. Introduction

Digital Signal Processing algorithms are widely used in signal conditioning by conventional and high-end Field Programmable Gate Arrays (FPGA). Finite Impulse Response (FIR) filter is one of the major units in DSP. They are frequently used in image, video, audio and mixed media signals [1,2].

Multiply-Accumulator-Circuit (MAC) is the main core of FIR filter implementations [2,3]. It requires to use Ktapes in conventional MAC-based FIR design. This not only increases the design cost but also gains more design complexity [1,2,3,4]. To resolve the difficulties of MAC in FIR filters, the concept of DA was introduced by Croisier. It was based on 2's complement and a novel bit position reordering [1]. In DA the multiplier is interestingly removed which in turns reduces the final DA-based FIR area drastically over MAC-based FIR.

978-1-4673-1148-9112/$3l.00 ©2012 IEEE 232

M. B. Ghaznavi-Ghoushchi, Shahed University

School of Engineering, Shahed University Shahed University of Tehran

Tehran, Iran [email protected]

DA-based FIR is also more suitable for low-power applications [2].

Generally, DA implementation is mainly in two ways of RAM-based and ROM-based methods. In ROM-based, all states are stored in the Look Up Table (LUT). This is technically called as pre-computation or pre-calculation and has a significant impact of low-power design. Therefore, ROM-based structures are highly efficient in power consumption and occupied area in the cost of fixed and pre-defmed filter coefficients which in turn limits the application scopes. In another alternative approach, to implement the FIR filter, a RAM-based schema is used. In this case the fixed filter coefficients are stored as the contents of the RAM unit. This allows changing and modifying the filter coefficients during the run-time of the filter for different required applications. From the power and area efficiency, the power consumption and consumed chip area are more than the ROM-based methods. The low-power and compact size of the ROMbased approaches and higher degree of reconfiguration in RAM-based alternatives from one side and relatively higher power of the former and high power of the later from the another side, are the two major motivation factors for researchers attempting to reduce the gap between them.

The LUT model of ROM is used in DA, has an important role in filter performance; on the other hand whenever the filter size increases, the LUT or ROM complexity (area) of the implementation described in the DA architecture grows exponentially. Therefore, the reduction of LUT or ROM size is a one field of research in this subject.

Speed or operating clock frequency, power and area are three basic parameters in the FIR filter design and performance measures. This research is directed to propose power and area improved DA architecture. This achievement is also targeted with reconfigurable or

flexible filter design to have both of the ROM-based performance and RAM-based flexibility. The proposed DA architecture exploits the circuit switching activity on the most active and power hungry units.

The organization of this paper is in the following manner. Section 2 gives mathematical relationships and background concepts of MAC and DA beside of the FIR filter implementation with both MAC and DA. In section 3 more detailed architectures are presented. The orientation and emphasis of section 3 is in the approach of power and area efficient design. In section 4, the new architecture for FIR filter implementation is proposed. In section 5, experimental results simulations, synthesis and performance evaluations are presented. Finally section 6 concludes the paper.

A.

II. Background Concept

Multiply and Accumulate

In computing, especially DSP, the MAC operation is a common step that computes the product of two numbers and adds that product to an accumu lator [I].

af-a+(bxc) (I) The mathematical function of a MAC is according to Equation (1) where b is the input, c is the coefficient and a represents the out of accumulator

B. Distributed Arithmetic

Multiplication operation by the DA with a mechanism to produce the product and then put together the product will be replaced. DA is basically (but not necessarily) a bit-serial computational operation. The advantage of DA is its efficiency of mechanization. A frequently argued disadvantage has been its apparent slowness because of its inherent bit-serial nature [5]. The general equation for sum of product is as follows:

y = I[=l AkXk = AI' Xl + ... + AK· XK (2)

Where Ak is fixed coefficients, Xk is the input data words, and the K is number of inputs. Let Xk be a N-bits scaled two's complement number where bkO is the sign bit.

IXk I < 1, Xk = {bkO , ... , bk(N-l)} We can express Xk as:

Xk = -bkO + I��i bknZ-n

(3)

(4)

Now combination of Equations 2 and 4 In order to express y in terms of the bits of xk results:

y = I[=l Ak( -bkO + I��i bknZ-n) I[=l Ak( -bkO) + I[=l I��i AkbknZ-n

= -I[=l AkbkO + I��i(I[=l Akbkn) . Z-n (5)

Equation 5 defines a distributed arithmetic computation. Consider the bracketed term I[=l Akbkn , Because each bkn may take the values of 0 and I only, so only 2K possible values are significant. We can compute these values on line (using a RAM), or precompute the values and store them in a ROM. The input data can be used to

233

directly address the memory and the result. After N such cycles, the memory contains the result [5].

C. FIR Filter Implementation

DA is one of the most known methods of FIR filter implementation. An FIR filter with K- length is described as:

y[n] =I[=oh[k].x[n-k] (6)

Where h[k] is the filter coefficient and x[k] is the input data [1,4]. Traditionally, direct implementation of aN-tap FIR filter requires N MAC blocks that is shown in Fig. 1 [1 ].

Xllll

hlOI

ylul

Fig. 1. Conventional tapped-delay line FIR Filter mechanization [6,8]

As shown in Fig. 2, another implementation of FIR filter is based on DA approach. The DA architecture includes three units: the shift register unit, the DA-LUT unit and the adder/shifter unit. In DA-LUT unit, the filter coefficients pre-stored and addressed by input data [1,4].

g, ift Regisler Un il

�Iput Signal

DA-LUT Unit � - - - - - - - - - - - - - - - - - - - - - - - - -: : b,b2b,� D31a .

0000

0001

0010

0011

0100 0101

OliO

0111

1000

1001

1010 lOll

• • 1100

1101

1110

1111

o

h[O] hPI

h[I)+h)O)

h[2[ h[2]+h[0] h[2]+h[l]

h[21+hPI+h[01 h[3 ]

h[3l+hIO[ h [3)+b[1)

h[3l+hPI�h[OI h[3]+h[2]

b[3] Ib[2]+b(0) h [3 ]+h [2)+h [1]

I[3J+h[2J+b[1J+h[OJ : � - - - - - - - - - - - - - - - - - - - - - - -_.' .. ______ �:��/��r_��t ____ :.

S I s,

t i A� I Ace"," ,,1.10r 1-, :-� yln] 'I

: B

• ____ I.... __ -_-__ - _ _ -__ -_--l ___ �� __ �_ ! lOUT I :� I :� I Fig. 2. Original LUI-based DA implementation of 4-tap FIR filter [1,4]

A.

III. Performance Factors in DA

L UT Less 1 DA Architecture

Modification in the original LUT-based DA of 4-tap FIR filter of Fig. 2 by halfmg the DA-LUT with removing the top half of DA-LUT unit sum with h[3] and leaving the lower half of DA-LUT unit is shown in Fig. 3. This results in replace of the lower half of DA-

LUT unit with a 2x 1 multiplexer and a Ripple Carry Adder (RCA), shown in Fig. 3 [1,4].

B. LUT Less2 DA Architecture

By the same method that describes in LUT lessl DA architecture, we have the LUT Less2 DA architecture for a 4-tap FIR filter, as shown in Fig. 4. In this architecture, all the LUT units can be replaced by multiplexers and adders [1,4],

C. Separated LUT DA Architecture

As the filter size increases, the hardware implemented cost of memory in DA architecture grows exponentially.

r x8[nll .. ·1 x,[n) 1 xoln) I lnput Signal

Top Half of DA-LUT

unit

yin)

Fig. 3. LUT Lessl architecture for a 4-tap FIR filter [1,4]

"rn-3]

hili C1'Bln_21.: :,",0':,,',"

J o l n-I] . . . x,ln-1( 'oln-I( o

h, h[OI

Input Signal

Fig. 4. LUT Less2 DA architecture for a 4-tap FIR filter [1,4]

For example, a 64-tap DA FIR filter will use at most 264

words LUT. One practical approach to overcome the area complexity of LUT in larger size FIR filters is breaking up the filter into smaller base DA sub-units. Therefore, if a T-tap filter is divided into k smaller filters each having t-tap DA base units (T = k x t). In Fig. 5, the

234

implementation of a 4-tap FIR filter for k = 2 and t = 2 is shown [2,3,7].

IV. Proposed Distributed Arithmetic Architecture

A. Power and activity model of the RAM-based

Distributed Architectures

The adder units in RAM-based DA are responsible for most of the consumed powers. This is mainly due to allof-the-time operating of adder units. In other word, adder units are used even for direct pass of filter coefficients without logic or level altering. This primarily consumes

Input Sign<tl

Fig. 5. Separated LUT DA architecture for a 4-tap FIR filter [2,3,7]

power in even no-action time slots. Therefore, if the adder unit be active just for its main purpose (addition of two different operands) in determined times, the total power consumption of mentioned unit will be reduced. This is achieved by restructuring of the RAM in LUT Less2 plan via adder unit replacement. 1) Activity inspection for low-power design

The outputs of the multiplexers in Fig. 6(a) are either zero or the constant coefficient of the filter. The output of the first multiplexer is 0 or hO and the output of the second multiplexer output is 0 or hI. The outputs of two multiplexers are the inputs of the adder. The input & output of the adder unit are presented in TABLE I.

TABLE I: The input & output of the adder unit in Fig. 6 (a)

Statel State2 State3 State4

Output of multiplexer I hO hO 0 0

Output of multiplexer2 hI 0 hI 0

Output of Adder hO+hl hO+O=hO hl+0=hl 0+0=0

When one or two inputs of the 8-bit adder are at LOW state, the 8-bit adder will be switched off and there is no need to use a direct adder in these cases. It means that in the occurrence of states of 2, 3 and 4 in TABLE I there is no need to use the adder unit. There is only one state in TABLE I, where the adder is required to be used. In this case only when the outputs of the multiplexers are hO and h I the adding process is performed.

The above observation implies that we may use from an array of tri-states wired-or configuration. In this arrangement, the adder unit is placed in the input of a tristate with enable signal equivalent to the state 1. The three other tri-state controls are correspondent to states 2, 3, and 4. To implement the control states of the tri-state, a 2-4 decode is required. From the practical implementation viewpoint, the wired-or may be replaced

by a 4-1 multiplexer. To generate the final output, the outputs of 4 tri-state units are feed to this multiplexer. To generate the proposed tri-state enable signals, the inputs of xo and xl are used as inputs of the decoder (TABLE II).

Dynamic operation of the FIR, there is a unit delay in the shift registers. Therefore, the state when xo=o and xl =1 (T2 in TABLE III), never occurs. This helps to simplify the structure of 2-4 decoder. Finally, Fig. 6(b) can be simplified to Fig. 6(c) that is more optimized in power consumption and the occupied area.

TABLE II: Truth Table of Decoder 2-4 with inputs ofxO and xl in FIR

xO xl qO ql q2 q3

0 0 0 1 1 1

0 1 1 0 1 1

1 0 1 1 0 1

1 1 1 1 1 0

TABLE III: Determine the status of Tri-states based on input pattern

xO xl Active Tri-state

0 0 T3

0 1 T2

I 0 TI

1 1 TO

2) Power activity in model of LUT in LUT Less2 DA

Architecture and Low Power unit with adder re-placed in

it

The power transition density, D(y), of the adder in architecture of Fig. 6(a) and 6(b), is according to Equation (7).

D (y) = L{a,il} D ({ iI, E}). P C�{iil}) (7) Where, D ({a:,5}) is the Transition density of bit vector ({a:, 5}) , 8y/8{a:,5} is Boolean difference of the output versus input vector ({a:, 5}) . The time variation of the adder inputs in model of Fig. 6(a) IS given by Equation (8) include four parts.

{iI, E} = {xo. hO, xO. 0, xl. hI, xl. O} (8) In the conventional DA, all of these 4 parts are encountered in the real power consumption of the system. But interestingly in the DA architecture all of these parts are not necessarily changing the output of adder unit, the required input variations of the adder inputs in model of Fig. 6( c) include only two parts is according to Equation (9).

{iI, E} = {xO. hO, xl. h,} (9) So, the model of Fig. 6( c) has the lower power transition density.

B. Proposed DA-LUT unit

The mentioned activity-based model of adder and multiplexer, applied on the units 1 and 2 of Fig. 7(a). The result of this stage modification is illustrated in Fig. 7(b). his is our first level of low-power low-area for DA architecture.

235

� h�

xo

xo X1

X1 xo

(a)

xo-

X1-

q3 ho h1

DecoderQ2 2-4

Q1

qO

(b)

Fig. 6. (a) Model of LUT in LUT Less2 DA Architecture. (b) GateLevel implementation of the part (a) plus decoder. (c) Low Power unit

with adder re-placed in it

(a) (b) Fig. 7. (a) Model of LUT for LUT Less2 DA architecture for 4-tap FIR

filter. (b) Implementation proposed unit for LUT Block of 4-tap FIR filter

After implementation of the proposed method on both the first and second adders, it is applied on the third adder (dashed box in Fig. 7(b )). The result of this step modification is depicted in Fig. 8. This is our final proposed low-power low-area DA architecture.

X2 X3

X3 X2

XO X1

X1 XO

Out

X1 Fig. 8. Proposed LUT for 4-tap FIR filter

In order to compare performance of the various DA architectures are mentioned in section3, primary the FIR filter code is wrote in VerilogHDL for all of the architectures, then the synthesis is done with design compiler. For this purpose the 3, 4 and 5 taps FIR filter with Conventional DA, LUT Less1, LUT Less2, Separated LUT and Proposed DA architectures are implemented. One of the implemented filter (Proposed DA) simulation results is shown in Fig. 9.

In the next step ASIC implementation of the FIR filters in target technology of 180nm CMOS is done. The Synthesis process for MAC and DA is performed in 180MHz frequency. In this paper in addition to the synthesis in normal mode, the clock gating technique is applied. At two mentioned synthesis cases, the offered method by this article leads to less power consumption. The area and power reports in synthesis are given at sections 5.1 and 5.2.

Fig 9. Simulation result of Proposed DA

A. Area Reports

The results of area report in synthesis for every DA architecture are shown in Fig. 10. As can be seen in Fig. 10 the DA is better than MAC from area view point. Also, offered methods for DA memory size reducing lead to decrease in area too.

1000-

M 800-, i 600-<

400-

200---,�-'------' Filterorrler

(\�itboU1 elK Gating)

Filter order (with elK Gating)

Fig. 10. Area report in synthesis

B. Total Dynamic Power Reports

The result of power report in synthesis is shown in Fig. 1 I. As it is obvious from the mentioned figures the consumed power of MAC is more than DA. Also among the offered optimizing methods for DA, the proposed DA

236

method has lower dynamic power under normal and with clock gating conditions.

C. Comparison between ROM-based and Proposed

DA

Finally in this section competition between ROMbased and proposed DA with the result of MAC and LUT Less2 (the best available architecture until now) is done. As the bellow, the proposed DA is very close to ROMbased in area and power consumption. In TABLE IV, V and Fig. 12 our reports are shown.

11.11 1 3-tap

Fih.,ronl"r (wirholllCJ.K (jaring)

I Fil'crOMcr

(with eLK Gating) Fig. II. Total Dynamic Power report in synthesis

TABLE IV' Comparison Area occupied (um2)

Implementation 3-tap 4-tap 5-tap

FIR filter

MAC 742251 962384 1160084

LlJT Less2 DA 287118 312884 341033

Proposed DA 228959.9 258415.5 285283.9

ROM-based DA 218453. 5 222292.6 226131.6

TABLE V' Comparison Total Dynamic Power (mw)

Implementation 3-tap 4-tap 5-tap

FIR filter

MAC 3.6633 4.5070 5.1112

LU TLess2 DA 2.9534 3.2648 3.6017

Proposed DA 1.8239 2.0236 2.1695

ROM-based DA 1.8794 1.9765 2.0735

.MACI3]

The

.

Best Available j-.��I�

I��ure DA

(LUT Less2) [1.41

•proposed DA with eLK Gating l

DROM Based DA [51 1 Filler order

,�-- .MAC[3]

J1 31ap 4tap

Fihcrordcr

Fig. 12. Comparison Area and Power between the results of existing

methods to implement the FIR filter

VI. Conclusions and future works

MAC and DA are widely used in Digital Signal Processing and Filters design. Distributed Arithmetic comes with better performance in area and power over MAC. In this paper an improved DA architecture is proposed. In our proposed DA, the switching activity of the adder units is clearly investigated and the high power consumption adder units are relocated in the system to lower the switching activity and total power. This action does not alters the number of adder units in the circuit, but the non-related and idle inputs to the adder which are dominant factors in power consumptions are located out of the adder path. The proposed DA exploits the circuit activity and the adder units are only used just in minimum states. Our designed distributed arithmetic unit has the run-time coefficient reconfigurability and lets to change or modify the target filter functionality during the idle intervals of the working system. This is inspite of the major disadvantage of conventional ROM-stored precomputed designs. The target architecture is implemented, verified and simulated with VerilogHDL. In the next step ASIC implementation of the proposed DA in target technology of 180nm CMOS is successfully done. Exact functionality simulation, and simulation results via 2-phase power calculations based on forward synthesis invariant approach and back ward synthesis oriented activity approach is used to calculate the power and area of the proposed DA and all known counterparts. Our experimental results urge the power and area efficiency of the proposed design versus the RAM -based counterparts. In the experimental results on 180n CMOS ASIC synthesis the maximum clock of 180Mhz is achieved. In the 5-tape FIR filter implementation of our proposed DA with clock gating enabled and best known LUT Less2, the dynamic power and area improvements are 39.76% and 16.35% respectively.

REFRENCES

[I] W. Sen, T. Bin, Z. jun, "Distributed Arithmetic for FIR Filter Design

on FPGA," International Conference on Communications, Circuits

and Systems, October 2007, pp. 620-623 [2] AM. AL-Haj, "An FPGA-Based Parallel Distributed Arithmetic

Implementation of the I-D Discrete Wavelet Transform," vol 29,

pp. 241-247, February 2004

[3] N. S. Pal, H.P. Singh, P. I(. Sarin, S. Singh, "Implementation of High

Speed FIR Filter using Serial and Parallel Distributed Arithmetic Algorithm," International Journal of Computer Applications, vol. 25, no. 7, pp. 26-32, July 2011

[4] Heejong Yoo, David Y.Anderson, "Hardware-Efficient Distributed Arithmetic Architecture For High-order Digital Filters," IEEE

International Conference on AcoustiCS, Speech and Signal

Processing, vol. 5, pp. 125-128, March 2005 [5] White, S. , "Applications of distributed arithmetic to digital signal

processing: A tutorial review," IEEE ASS? Magazine, vol. 6, no. 3,

pp. 4-19,1989

[6] C. Singh, B. Singh, "Design of High Performance Modified Wave

pipe lined DAA Filter with Critical Path Approach," International Journal of Electrical and Electronics Engineering (IJEEE), vol. 1, Issue-II, ISSN (PRINT): 2231 - 5284, 2011

[7] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson,

"LMS adaptive filters using distributed arithmetic for high

237

throughput," IEEE Transactions on Circuits and Systems, vol. 52, no. 7,pp. 1327-1337,2005

[8] C.Xu, F. Dong, S. Member, "Distributed Arithmetic FIR Filter for

Electrical Resistance Tomography System," Control and Decision Conference (CCDC), 2011, pp. 3548-3553

Date post:	08-Oct-2016
Category:	Documents
Upload:	m-b
View:	215 times
Download:	0 times

[IEEE 2012 20th Iranian Conference on Electrical Engineering (ICEE) - Tehran, Iran...

Documents