+ All Categories
Home > Documents > Research Article Design of Synthesizable, Retimed Digital...

Research Article Design of Synthesizable, Retimed Digital...

Date post: 09-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
19
Research Article Design of Synthesizable, Retimed Digital Filters Using FPGA Based Path Solvers with MCM Approach: Comparison and CAD Tool Deepa Yagain and A. Vijaya Krishna Department of ECE, PESIT, Bangalore 560085, India Correspondence should be addressed to Deepa Yagain; [email protected] Received 6 March 2014; Accepted 15 May 2014; Published 24 July 2014 Academic Editor: Jose Silva-Martinez Copyright © 2014 D. Yagain and A. Vijaya Krishna. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Retiming is a transformation which can be applied to digital filter blocks that can increase the clock frequency. is transformation requires computation of critical path and shortest path at various stages. In literature, this problem is addressed at multiple points. However, very little attention is given to path solver blocks in retiming transformation algorithm which takes up most of the computation time. In this paper, we address the problem of optimizing the speed of path solvers in retiming transformation by introducing high level synthesis of path solver algorithm architectures on FPGA and a computer aided design tool. Filters have their combination blocks as adders, multipliers, and delay elements. Avoiding costly multipliers is very much needed for filter hardware implementation. is can be achieved efficiently by using multiplierless MCM technique. In the present work, retiming which is a high level synthesis optimization method is combined with multiplierless filter implementations using MCM algorithm. It is seen that retiming multiplierless designs gives better performance in terms of operating frequency. is paper also compares various retiming techniques for multiplierless digital filter design with respect to VLSI performance metrics such as area, speed, and power. 1. Introduction High level synthesis is the process of converting behavioral description or an algorithm to structural level specification. In behavior description or an algorithm, the input and output behavior is described in terms of data transfers and operations without any implementation details. Structural description maps this data transfers and operations into combinational functional units and registers on to hardware. High level synthesis of DSP algorithms is very much useful as it reduces time to market window. Various optimization methods are available in literature for sequential synthesis [1]. ough synthesis of combinational logic has attained a signif- icant level of maturity, sequential circuit synthesis has been lagging behind [2] in terms of frequency performance. DSP algorithms are repetitive and periodically iterations must be repeated to execute the computations [3]. Here, iteration period is the minimum time needed for computation and this is limited by critical path. Critical path can be altered by redistributing the delays such that functionality is preserved. Retiming algorithm [4] is used to redistribute the delays without altering [5] the functionality. A great amount of research has been done on retiming [5, 6]. e retiming technique is the valuable optimization technique in problems of digital filters which can be repre- sented as data flow graphs (DFGs). Efficient filter systems are needed to decrease the overall computation time since scientific applications can be recursive, nonrecursive, and iterative. Retiming transformation along with other high level transforms like multiple constant multiplication approach for filters in high level synthesis aids in reducing area of the filter circuit and most importantly decreases the clock period. Critical path and shortest path computations consume most of the time in retiming computation. e retiming mini- mizes the overall clock period, thereby increasing the clock frequency by reducing the filter critical path. In the general Hindawi Publishing Corporation VLSI Design Volume 2014, Article ID 280701, 18 pages http://dx.doi.org/10.1155/2014/280701
Transcript
Page 1: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

Research ArticleDesign of Synthesizable Retimed Digital FiltersUsing FPGA Based Path Solvers with MCM ApproachComparison and CAD Tool

Deepa Yagain and A Vijaya Krishna

Department of ECE PESIT Bangalore 560085 India

Correspondence should be addressed to Deepa Yagain deepayagaingmailcom

Received 6 March 2014 Accepted 15 May 2014 Published 24 July 2014

Academic Editor Jose Silva-Martinez

Copyright copy 2014 D Yagain and A Vijaya Krishna This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Retiming is a transformation which can be applied to digital filter blocks that can increase the clock frequencyThis transformationrequires computation of critical path and shortest path at various stages In literature this problem is addressed at multiple pointsHowever very little attention is given to path solver blocks in retiming transformation algorithm which takes up most of thecomputation time In this paper we address the problem of optimizing the speed of path solvers in retiming transformation byintroducing high level synthesis of path solver algorithm architectures on FPGA and a computer aided design tool Filters havetheir combination blocks as adders multipliers and delay elements Avoiding costly multipliers is very much needed for filterhardware implementation This can be achieved efficiently by using multiplierless MCM technique In the present work retimingwhich is a high level synthesis optimization method is combined with multiplierless filter implementations using MCM algorithmIt is seen that retiming multiplierless designs gives better performance in terms of operating frequency This paper also comparesvarious retiming techniques for multiplierless digital filter design with respect to VLSI performance metrics such as area speedand power

1 Introduction

High level synthesis is the process of converting behavioraldescription or an algorithm to structural level specificationIn behavior description or an algorithm the input andoutput behavior is described in terms of data transfers andoperations without any implementation details Structuraldescription maps this data transfers and operations intocombinational functional units and registers on to hardwareHigh level synthesis of DSP algorithms is very much usefulas it reduces time to market window Various optimizationmethods are available in literature for sequential synthesis [1]Though synthesis of combinational logic has attained a signif-icant level of maturity sequential circuit synthesis has beenlagging behind [2] in terms of frequency performance DSPalgorithms are repetitive and periodically iterations must berepeated to execute the computations [3] Here iterationperiod is theminimum time needed for computation and this

is limited by critical path Critical path can be altered byredistributing the delays such that functionality is preservedRetiming algorithm [4] is used to redistribute the delayswithout altering [5] the functionality

A great amount of research has been done on retiming[5 6] The retiming technique is the valuable optimizationtechnique in problems of digital filters which can be repre-sented as data flow graphs (DFGs) Efficient filter systemsare needed to decrease the overall computation time sincescientific applications can be recursive nonrecursive anditerative Retiming transformation alongwith other high leveltransforms likemultiple constantmultiplication approach forfilters in high level synthesis aids in reducing area of thefilter circuit andmost importantly decreases the clock periodCritical path and shortest path computations consume mostof the time in retiming computation The retiming mini-mizes the overall clock period thereby increasing the clockfrequency by reducing the filter critical path In the general

Hindawi Publishing CorporationVLSI DesignVolume 2014 Article ID 280701 18 pageshttpdxdoiorg1011552014280701

2 VLSI Design

purpose processor where actual retiming vectors are com-puted for digital filters the speed with which the retimingtransformation is performed suffers since the entire trans-formation code will be written as software Hence FPGAbased path solver architecture is designed in this paper whichaddresses the frequency issue in retiming and reduces theburden on general purpose processors

A computer aided design (CAD) tool framework calledDiFiDOT is developed which generates the synthesizablehardware descriptions of chosen digital filter with specifieduser constraints such as area speed and power Since thedigital filters are composed of adderssubtractors multipli-ers and delay elements DiFiDOT picks the best choiceof adders and multipliers as per users design constraintsAlso multiplication operation is expensive in terms of areapower and delay Exchanging multipliers with adders isadvantageous because adders weigh less than multipliers interms of silicon area [7] Since the coefficients to bemultipliedare known beforehand the full flexibility of multiplier is notnecessary in the design So a multiplierless design in digitalfilter is proposed under multiple constant multiplicationsarchitecture and an option is included in DiFiDOT for gener-ating multiplierless hardware descriptions This significantlyreduces the area of filters when compared to those designedusing multiplier blocks Here sharing of partial terms inmultiple constantmultiplications (MCMs) concept [8] is usedwhich reduces area and covers all possible partial terms thatmay be used to generate the set of coefficients in the MCMinstance For simulations the authors have used some ofACMSIGDA benchmark circuits

2 Background

High level transformation techniques are applied to getoptimal speed in sequential filter systems For the designedoptimization environment input is considered as data flowgraphs This section introduces the data flow graphs (DFGs)and problem definitions and gives an overview on previouslyproposed retiming transformation algorithms with theirdrawbacks

21 Data Flow Graphs (DFGs) Digital filters are importantpart of digital signal processor Their extraordinary perfor-mance is one of the parameters that made DSP become sopopular [9] Filters are used in audio processing speechprocessing (detection compression and reconstruction)modems motor control algorithms video and image pro-cessing and so forth Retiming is important step [10] in highlevel synthesis (HLS) of digital filters HLS is nothing butto map behavioural descriptions of algorithms to physicalrealizations All digital filters which are iterative recursiveand nonrecursive can be represented using data flow graphs(DFGs) [11 12] Any digital filter can be realized by functionalblocks such as addersubtractor and multiplier with delayelements A filter DFG consists of such functional blocks andconnectivity information for the data flow This example isa 4th-order low pass elliptic filter block High level trans-formations operate on the filter functional blocks for better

performance This can be done by changing the executionorder or by altering the number of functional blocks inthe critical path of retiming [3] process Performance canalso be improved by altering the architecture of functionalblockrsquos implementation characteristics without altering thefunctionality of the filter in the optimization environmentIn this paper MCM technique is used to achieve this Forapplying all these transformations input is given in the formof DFGsThe input DFGs can also be represented in the formof matrices for further computations In this paper 4th-orderlow pass elliptic filter is used as an application example toexplain the present work The elliptic filter response is givenby

1003816100381610038161003816119867 (119895119908)

1003816100381610038161003816

2

=

1

1 + 12059821198772

119899(120596 120574)

(1)

where 1198772119899(120596 120574) is the 119899th order Chebyshev rational function

with ripple parameter 120574 Let Ap be the maximum pass-bandloss and let As be the minimum stop-band loss in decibelsDepending on the need we can define 120596

119901and 120596

119904which are

pass-band cutoff frequency and stop-band cutoff frequencyThe selectivity factor 119896 is computed using 120596

119901and 120596

119904 With

these parameters modular constant 119902 = 119906 + 2119906

5+ 15119906

9+

150119906

13 is computed where 119906 is

119902 =

1 minus

4radic1 minus 1198962

2 (1 +

4radic1 minus 1198962)

(2)

The discrimination factor119863 and the filter order 119899 are used toobtain 119860

119904which is given by

119860119904= 10 log(1 + 10

11986011990110minus 1

16119902119899

) (3)

The second order elliptic filter block is as shown inFigure 1(a) The typical DFG for the filter is shown inFigure 1(b) For the designed optimization environment thefilter information is given in the form of matrices Thereare two matrices which represent the filter information Thenode-weight matrix represents node weights that are nothingbut computation time unit delays in the filter graph Thecomputational complexity of the adder is 119874(119899) whereas formultiplication it is 119874(1198992) for two 119899 digit numbers Multipli-cation computation complexity is higher when compared toaddition Hence in the present work time delay consideredfor multiplication in retiming is twice that of addition Inci-dence matrix defines the edge weights between all the nodeswhich represent connectivity information The number ofdelay elements present in between the computation nodes(adder and multiplier) is considered as edge weight If thereare no delay elements and adder or multiplier node is directlyconnected to another node then the edge weight will havezero value The node weight matrix and incidence matrixare used as the inputs for optimization environment wherehigh level transformations are applied to obtain performanceimprovement The critical path and shortest path of thefilter are computed for retiming which is one of the efficient

VLSI Design 3

1Output

-K-

b(5)

-K-b(4)

-K-b(3)

-K-

b(2)

-K-

b(1)

-K-

a(5)

-K-a(4)

-K-a(3)

-K-

a(2)

1In1

Zminus1Zminus1

Zminus1 Zminus1

Zminus1Zminus1

Zminus1 Zminus1

minus+ +

+

++

++

++

++

++

++

(a)

1

1

1

1 1

1

Adder 1

Adder 2

Adder 3

Adder 4

Adder 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Mult 11

Mult 12

Node 13

Mult 14

Mult 15Mult 16

Mult 17

(b)

Figure 1 (a) Block diagram of 4th-order low pass elliptic filter (b) DFG of elliptic filter block DFG

optimization techniques to obtain a filter solution withreduced clock period which in turn increases the filter speedThe critical path can be obtained by observing Figure 1(b)DFG The critical path is node 1 rarr 9 rarr 5 rarr 14 rarr 8

Retiming Transformation Retiming is a high level transfor-mation technique in which the location of the registers isaltered in such a way that the overall clock period reducesthereby increasing the clock frequency [13] This happensdue to reduction in the critical path which bounds thespeed of the design Due to intelligent placement of registersthe clock period gets minimised without altering the filterfunctionality Critical path is the longest computation pathin between computational elements [14] or delay elementsThe critical path can also be minimized by inserting thedelay elements on the primary inputs of the filter circuitand retiming the circuit This is called automatic pipeliningtechnique Both themethods are used to find the best optimalsolution in the present work Retiming for filter optimizationis found to be NP complete problem and time to find thesolution increases as the problem size increasesThere are twoways of applying retiming transformation

(i) retiming using clock period minimization method(ii) retiming using register minimization method

The retiming algorithm for clock period minimization isefficient in terms of clock frequency improvement Its com-putational complexity is 119874(1198993 log 119899) where 119899 is the numberof nodes which are nothing but computation elements suchas adders and multipliers The algorithm starts by building anew graph from the original DFG The new graph can giveus a set of inequalities called the critical path constraints

The original DFG also presents a set of equalities called thefeasibility constraints A constraint graph can be built fromthe critical path constraints and the feasibility constraintsThe retiming values for each node can be derived by applyinga Floyd-Warshall shortest path algorithm to the constraintgraph The weight for each edge in the retimed DFG can becalculated using the original weight and the retiming valuesof the two nodes are connected by this edge

(i) Calculate119872 = 119905max119899

where 119899 represents the numberof nodes in the original DFG 119866 and 119905max is themaximum computation time of all the nodes in theDFGAlso compute the critical pathwhich defines therequired clock period of original graph

(ii) A new DFG 119866

lowast can be created from 119866 119866lowast has thesame nodes and edges as 119866 For each edge in 119866lowast theedge weight is 119882lowast(119890) = MW(119890) minus 119905(119906) where 119882(119890)is the edge weight of the same edge in 119866 119905(119906) is thecomputation time of the node initiating this edge

(iii) We then apply the Floyd-Warshall shortest path algo-rithm to compute 119878lowast

119880119881 which represents the shortest

path from node 119880 to node 119881(iv) From 119878

lowast

119880119881 119882119880119881

and 119863119880119881

are calculated If 119880 = 119881then119882

119880119881= 119878

lowast

119880119881119872 and119863

119880119881= 119872119882

119880119881minus 119878

lowast

119880119881+ 119905(119881)

If 119880 = 119881 then119882119880119881

and 119863119880119881= 119905(119906) Here 119905(119880) and

119905(119881) represent the computation times of node 119880 andnode 119881 respectively

(v) We then find the maximum value of 119863119880119881

and theminimum values of 119863

119880119881 We check all the possible

clock periods starting frommaximumvalue of119863119880119881

tominimum value of119863

119880119881one by one If we find a clock

4 VLSI Design

period that can give us a feasible solution we stopand find theminimal clock period by solving for solu-tions critical path The solution contains the retimingvalues for all nodes Here 4th-order low pass IIRelliptic filter is designed and retimed using the abovementioned clock periodminimization algorithmTheretimed data flow graph with reduced clock period isshown in Figure 2

After applying retiming transformation to the filter thecritical path changes to 2 rarr 1 rarr 9

Since each delay element occupies about one-third ofthe binary adder it is important to reduce the number ofdelay elements [11] In retiming using register minimizationwe can obtain the digital filter that uses minimum numberof registers and satisfies the clock period constraints [15]Here forward splitting or register sharing [12] is used Ifthe node has several output edges carrying the same signalthe number of registers required to implement these edgesis the maximum number of registers on any one of theedges Consider Figure 3 The maximum number of registersrequired in Figure 3(a) is 6 whereas after register sharing thisgets reduced to 3 as shown in Figure 3(b)

The number of registers needed to construct this outputedges (119890) in retimed graph119882

119903and the total cost are

119877V = Max (119882119903(119890)) Cost = sum119877V (4)

The cost is WRT

(i) fan-out constraints 119877119881ge 119882119903for all 119881 and all edges

119881

119890

997888rarr 119886119899119910 119900119905ℎ119890119903 V119890119903119905119890119909

(ii) feasibility constraints 119903(119880) minus 119903(119881) ge 119882(119890) for everyedge 119880 119890997888rarr 119881

(iii) clock period constraints 119903(119880)minus119903(119881) ge 119882(119880119881)minus1

for all vertices such that 119863(119880119881) ge 119888 where 119888 is theclock period

This method makes use of gadgets to represent the nodeswith multiple edges The register minimization retiming canbe modeled as linear programming problem A dummy nodewith zero computation time will be introduced in this Theweight of the edge 119890

119894is defined to be119882(119890

119894) = 119882maxnot119882(119890119894)

where119882max = max(119882(119890119894)) where 1 le 119894 le 119870 where 119896 is the

number of edges available Also 120573 parameter is used which isthe breadth associated tomodel thememory required by edge119890119894 The breadth of each edge is inverse of 119896 A binary search

is performed for clock period and below is the procedureused while performing retiming using register minimisationThe register minimization retiming values can be obtained asbelow

(i) Use the gadgetmodel of the graph to compute the costfunction

(ii) Calculate 1198781015840 by using shortest path Floyd-Warshallalgorithm

(iii) Compute 119863(119880119881) and 119882(119880119881) matrices from theoriginal graph and 1198781015840 matrix

(iv) Perform LP formulation such that the cost functiongets minimized which is subjected to feasibility andclock period constraints

This LP problem is solved to obtain the retiming solutionwhich minimizes the number of registers by satisfying theclock period Figure 4 shows the DFG of 4th-order low passIIR elliptic filter It is observed that the register minimumretimed solution provides the filter solution with reducedregister count for reduced clock period However in somecases it is found that clock period minimization efficiencyreduces in comparison to clock period minimization retim-ing technique as the priority is given to the register countFor the considered elliptic filter for a clock period of 4 unitsit is found that the register count gets minimized to 9 Afterapplying register minimization retiming transformation tothe filter the critical path changes to 1 rarr 9 rarr 5

Problem Formulation Critical path and shortest path solvingcontribute to most of the computation time in retiming

Definition 1 (the path solver problem) Let 119878 =

1199040 1199041 1199042 1199043 119904

119896 where 119896 is the maximum number

of feasible solutions available for retiming of a consideredfilter DFG During retiming of digital filters in high levelsynthesis the shortest path between the nodes must becomputed for (119896 + 1) times where 119896 is the number offeasible solutions available for the DFG which is nothingbut unique entries in path delay 119863 matrix Similarlythe critical path must be computed for (119896 + 1) Generalpurpose processors (GPPs) where retiming algorithm isimplemented are fully programmable but are less efficientin terms of power and performance Hence the problem isto improve the performance and power of retiming usingFPGA based path solvers Further along with retiming highlevel transformation technique called automatic pipeline isapplied to improve the filter speed

Definition 2 (multiple constant multiplication in digitalfilters) For the considered filter coefficient constant 119879 inthe retimed filters find the set of multiplierless operations1198741 1198742 1198743 119874

119899 with minimum number of addition

subtraction and shift operations using multiple constantmultiplier architecture to optimize the filter architecturefurther

Definition 3 (optimization and automation of filter HDL)An environment needs to be developed to obtain HDLsof retimed filters in which user can choose different datapath element architectures depending on the specificationsThis reduces time to market and helps to evaluate a lotof hardware implementation trade-offs Filter equivalencechecking after applying high level transformation needs tobe done which needs to be developed as a part of theoptimization environment

Principle of Shortest Path andMCMAlgorithm Several FPGAsynthesis algorithms have been proposed specifically forsequential circuits In [16] authors have proposed how to

VLSI Design 5

11

2

1

1

1

1

1

2

1

Adder 1

Adder 2

Adder 3

Adder 4

Adder 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Adder 11

Mult 12

Mult 13Mult 14

Mult 15

Mult 16

Mult 17

(a)

0

2

4

6

8

10

12

Clock period in time unitsRegister count

Filter after clock period minimization retiming

Filter before retiming

(b)

Figure 2 4th-order elliptic filter after clock period minimization retiming (a) DFG after retiming (b) clock period and register count beforeand after retiming

2

3

1

(a)

1 11

(b)

Figure 3 (a) Graph before register sharing (b) Graph after register sharing

map retimed circuits on to FPGAs efficiently However inthis paper authors suggest a method for efficient retimingprocess using FPGA based path solvers This can be appliedto any retiming techniques available in literature Shortestpath is solved in filter DFG using Floyd-Warshall algorithmThe Floyd-Warshall algorithm uses an approach of dynamicprogramming to solve the shortest-paths problem on a DFGThe Floyd-Warshall Algorithm can solve the shortest pathproblem in 119874(1198993) time where 119899 is the number of nodes inthe DFG Let 119889

119894119895(119896)denote the weight of the shortest path

from 119894 to 119895 such that all intermediate vertices are containedin the set 1 2 119896 That is the path 119901 is decomposed into119894 rarr 119896 rarr 119895 Let the vertices in the graph be numberedfrom 1 2 119899 Consider the subset 1 2 119896 of these 119899vertices Find the shortest path from vertex 119894 to vertex 119895 thatuses vertices in the set 1 2 119896 only Then there are twosituations possible

(i) 119896 is an intermediate vertex on the shortest path

(ii) 119896 is not an intermediate vertex on the shortest path

If the vertex 119896 is not an intermediate vertex on 119901 then

119889119894119895(119896) = 119889

119894119895(119896 minus 1) else 119889

119894119895(119896) = 119889

119894119896(119896 minus 1) + 119889

119896119895(119896 minus 1)

(5)

In either case the subpaths contain nodes from 1 2 (119896minus

1) Therefore

119889119894119895(119896) = 119889

119894119895(119896 minus 1) + 119889

119896119895(119896 minus 1) (6)

When 119896 = 0 then

119889119894119895(0) = 119882

119894119895

and 119894119891 1198960 then 119889119894119895(119896) = min 119889

119894119895(119896 minus 1) + 119889

119894119895(119896 minus 1)

(7)

Let119863 be the incidence matrix with the graph edge weightinformation119882 initially119863 is then updatedwith the calculatedshortest paths see Algorithm 1

The final 119863 matrix will store all the shortest paths Thisalgorithm is extended for retiming of digital filters

The multiple constant multiplication (MCM) problem isaddressed in the literature [14] using either graph basedmeth-ods or using common subexpression elimination method Incommon subexpression elimination algorithm all possiblesubexpressions are extracted for a variable But this is possibleonly if it is defined as minimum signed digit and as canonicalsigned digit Then the subexpression is found such that itcan be shared by multiple constant multiplication valuesIn this paper the above two concepts are extended forautomatic pipelining and retiming of digital filters in high

6 VLSI Design

11

12

1

11

1

11

1 1

1

2

Adder 1

Adder 2

Adder 3

Adder 4

Add 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Mult 11

Mult 12

Mult 13

Mult 14Mult 15

Mult 16

Mult 17

(a)

0123456789

Clock period in time unitsRegister count

Filter before retiming Filter after registerminimization retiming

(b)

Figure 4 4th-order elliptic filter after register minimization retiming (a) DFG after retiming (b) clock period and register count before andafter retiming

(1) n = of rows in W 119863

0 = W

(2) for(k=1 to n)

(3) for(i=1 to n)

(4) for(j=1 to n)

(5) 119889

119896

119894119895= min119889(119896minus1)

119894119895 119889

(119896minus1)

119894119896+ 119889

(119896minus1)

119894119896

(6) end for

(7) end for

(8) end for

(9) return 119863

119899

Algorithm 1

level synthesis In all the digital filters the filter coefficientsare known beforehand Hence full flexibility of themultiplieris not necessary and we can make use of MCM designsThis method is more efficient when compared to shift andadd multiplications as intermediate results can be sharedwhich reduces the area of multiplierless implementation ofdigital filters The sharing of intermediate result will providepotential area saving with increased filter order (Figure 5)

Consider the filter coefficient set which is to be used forthe filter design given by119879 = 119888

1 1198882 1198883 119888

119899 we need to find

the smallest set 119878 given by 1198861 1198862 1198863 s

1 1199042 1199043 where

119886 (addersSubtractors) amp 119904 (shifts) lt 119878 such that the set ismade of adderssubtracters shifters and 119860 operations Hereshift operations also can be shared across multiple points sothat the output set is optimum Here 119867cub algorithm [8] isused to generate corresponding DFG for the multiplier blockimplementing the parallelmultiplications 119888

1lowast119909 1198882lowast119909 119888

119899lowast

119909 The only operations used in the generated DAG andinput design matrices are additions subtractions shifts andnegations In this paper performance of MCM based filterdesigns is further improved by combining this approach withretiming The multiplierless filter circuit is further retimed

to reduce the overall clock period which increases the clockfrequency

Consider 1198971and 1198972as two integerswhich specifies left shifts

and 119903 ge 0 specifies right shift and let 119904 be the sign bit whichcan be 0 1 An119860 operation is an operation with two integerinputs 119906 and V and one fundamental output which is definedas

119860119901(119906 V) = 100381610038161003816

1003816(119906 ≪ 119897

1) + (1) 119904 (V ≪ 119897

2)

1003816100381610038161003816

≫ 119903 = 2

1198971119906 + (minus1)

1199042

1198972V | 2minus119903

(8)

where≪ is a left binary shift≫ is a right binary shift and 119901 =1198971 1198972 119903 119904 is the parameter set or the 119860 configuration of 119860

119901

To preserve all significant bits of the output 2119903 must divide2

1198971119906 + (minus1)

1199042

1198972V The left shifts are limited to the bit width of

the target All 119860 operations are used to build 119860 minus 119892119903119886119901ℎ Fora given set of target filter coefficients119862 we can find set 119878 suchthat multiplierless digital filter is designed

VLSI Design 7

MCMblock

Y

k1Y

k2Y

k3Y

k4Y

k5Y

(a)

Y

8Y

32Y

(minus1)7Y

(minus1)31Y

O1

O2

14Y

O3 = 45Y = 31Y + 14Y

(+)45Y

O3

O1 = 7Y = 8Y minus 1Y

O2 = 31Y = 32Y minus 1Y

≪3

≪5

≪1

(b)

Figure 5 Example for addressing MCM problem in digital filters

3 Design and Analysis

EachDSPfilter block is associatedwith the critical pathwhichlimits maximum iteration period in the filter design [12]This can be reduced by retiming where the clock period getsreduced and increases the clock speed To reduce the criticalpath we need to find the original critical path of the circuitusing critical path solving algorithm and then apply retimingtransformation to digital filter While retiming shortest pathalgorithm is required for solving the system inequalitiesFPGAs are nothing but set of configurable logic blockswith configurable interconnects Designer can program itto work like a specific hardware These give great speedupover general purpose processors for many long runningalgorithms Hence for high performance systems FPGAsbecome a better choice In the present work path solvers areimplemented on FPGA to increase the performance

31 Critical Path Solver Algorithm Design and Analysis Thecritical path is defined as maximum delay path between theoutput node and node causing the state change of the outputnode with zero delay The significance of the critical pathis that it determines the operating frequency of the designIn retiming which is one among the steps in high levelsynthesis it is imperative that we find the critical path [17]in real time To speed up this process the use of a dedicatedFPGA hardware can speed up the process with low powerConsider 120572 = Number of adder elements and 120573 = Numberof multiplier elements in the considered digital filter Let= 1198991 1198992 119899

119894 where 119894 = 120572 + 120573 which is maximum

combinational adder and multiplier elements Consider 119874 =1199001 1199002 119900

119895where 119874 is the set of output nodes in the filter

circuit and 119868 = 1198941 1198942 119894

119896 where 119868 is the set of input

nodes in the filter circuit such that IN and ON The criticalpath of the circuit is defined in terms of 120574

1198991

which is thedelay of individual combinational block In this procedureof computing critical path on FPGA it sorts the verticessuch that vertices occurring early in the list are connected tovertices later in the list by edges having zero delays While

sorting if the vertex is connected to previous one then pathlength is sum of its time with the sum of all the vertices foundin the path otherwise path length of the node is equal toits own computation time We need this for constructing theretimed graph as well as verifying the retimed graph resultThe equation of the critical path is

120574119898=

119894=119873

sum

119894=1

1199051198981

(9)

where 119873 is the sum of adder and multiplier elements in thetopologically sorted vertices connectedwith zero delay edgesThe delay of the circuit is given by 119905

119889= max120574

119898 where

119905119889is the delay of the critical path Algorithm 2 shows the

critical path formulation In the considered optimizationenvironment the below steps are used for critical pathcomputation

(i) The filter network graph is considered as input tocritical path solver algorithm

(ii) All the zero-weight edges in the network graph arefound and a matrix of their source and destinationnodes is formed

(iii) For each row in the above matrix if the destinationnode of any zero-weight edge path is the same as thesource node of the zero-weight edge path the twopaths are joined This step is repeated to obtain amatrix whose rows will have nodes of all the possiblezero-weight edge paths in the graph

(iv) The computational time of each zero-weight edgepath from this matrix is calculated

(v) The zero-weight edge path with the greatest compu-tational time is found This is the critical path and itscomputational time is the critical path delay

A critical path solver algorithm is designed in the presentwork on FPGA The state diagram for the implementedcritical path solver is given in Figure 6 In 1198780 the filter graphor matrix is given as input to the critical path solver module

8 VLSI Design

Loop indices updated

Loop indices updated

Next greater path delay found

Loop indices updated

Loop indices updated

All

path

del

ay fo

und

Superset of zero-weight path found

All

zero

pat

h de

lay

foun

d

Weight temp updated

Weight temp updated

S0

S1 S2

S3 S4S5S6

S7S8

Incidence and node matrix

obtained

Greates

t path

delay

found

Zero-weight path updated

Figure 6 Critical path solver state diagram

(1) Algorithm for computing the critical path

(2) Input a DFG of G = (VEtd) Where c is the

(3) computation time of the node and d

(4) is the initial delay on edge E

(5) Output Critical path C

(6) Sort all the vertices topologically in the DFG G

(7) with v fallowing u

(8) if there is a zero delay edge from 119906 rarr V(9) For all vertices from the sorted list

(10) If non zero delay on the edge E in G then

(11) 120574119894= 119905V

(12) else(13) 120574

119894= 119905V + max(120574

119894) isin edge 119890 119906 rarr V in 119866 with 119889

119890= 0

(14) end if

(15) 120574 = 1205741 1205742 120574

119898

(16) where m = number of entries in the topologically sorted list

(17) end for

(18) compute 120574 = max120574

Algorithm 2

Since HDL does not provide a method to represent infinitysome number say 255 can be chosen which is always greaterthan any otherweight in the incidencematrix Also since edgeweight 0 is a valid input any negative number say minus1 canbe used to denote the uninitialized matrix element In state1198781 all the zero weight edges in the DFG are found alongwith their source and destination nodes and are stored in amatrix called zero weight pathThe zero weight pathmatrixcontains two columns The first column contains the sourcenode of a directed zero-weight edge while the second columnhas the destination node of the directed zero-weight edgeSimultaneously we will keep a count on the number of zero-weight edges

The state 1198782 is provided to enable looping action andfor updating of all the signals In state 1198783 in each row ofzero weight path matrix the module will find the next node

with a zero-weight edge connecting it to the node in theprevious column (if it exists) Thus if the destination node inany zero-weight path is same as the source node in anotherzero-weight path the two paths are concatenated that is ifthe destination node in path 119886 is the source node in path 119887then we make the destination node in 119887 as the destinationnode of 119886 The state 1198784 is provided to enable looping actionand for updating all the signals At the end of this state the119911119890119903119900 119908119890119894119892ℎ119905 119901119886119905ℎ matrix will contain only those supersetpaths that are a superset of the remaining zero weight pathsIn state 1198785 the module calculates the sum of all the nodeweights through each of these paths State 1198786 is provided forlooping action and for updating all signals

In state 1198787 the path with the highest node weights sum isfound which is the critical path of the DFG All the nodes inthis path are then stored in order in amatrix called the critical

VLSI Design 9

(1) Algorithm for computing the shortest path

(2) Input a DFG of G = (VEtd) Where c is the computation time of the node and d

(3) is the initial delay on edge E

(4) Output All pair shortest path matrix M

(5) for i = 1 to N

(6) for j = 1 to N

(7) if i = j then

(8) M[ij] = (00)

(9) else M[ij] = inf

(10) end for

(11) end for

(12) for all the edges 119890 119906 rarr V119872[119906 V] = 119889 for edge e

(13) for 119896 rarr 1 to N

(14) for 119894 rarr 1 to N

(15) for 119895 rarr 1 to N

(16) if 119872[119894 119895] gt 119872[119894 119896] + 119872[119896 119895]

(17) M[ij] = M[ik] + M[kj]

(18) end for

(19) end for

(20) end for

(21) Output shortest path matrix M

Algorithm 3

1 2 3 4 50

2

4

6

8

10

12

14

16

18

Vertex number in filter DFG

Zero

del

ay p

ath

for fi

lter D

FG

Figure 7 Zero path delays and critical path for 4th-order low passelliptic filter

path matrix These signals in this matrix are output as thecritical pathThe state 1198788 is provided to enable looping actionand for updating all the signals The state machine then goesback to state 1198780 and awaits new inputs Next algorithm to findthe shortest path between two nodes in a graph is describedFor retiming technique in high level synthesis we need theshortest path to solve system of inequalities It is seen thattime needed to compute critical path on FPGA is reasonablyless when compared to computation on general purposeprocessor This also reduces the retiming computation timeThe zero delay paths are computed for 4th-order elliptic filtershown in Figure 7 The highlighted path delay is from1 rarr

9 rarr 5 rarr 14 rarr 8 where nodes 1 5 8 are adders and 9 14

are multipliers Maximum path delay which is highlighted isconsidered to be the critical path

32 Shortest Path Solver Algorithm and State Diagram Let119863(119906 V) be the maximum delay between nodes 119906 and V andlet 119879(119906 V) be total computation time of zero delay path from119906 to V We can check the condition 119879(119906 V) minusmin119905(119906) 119905(V) gt119889119890119903119894V119890119889 119888119897119900119888119896 119901119890119903119894119900119889 then select those paths to retime sothat computation time in this path can be reduced Wehave to retime the edges by constructing system of linearinequalities This can be done using Floyd-Warshall shortestpath algorithmAlgorithm 3This can be used for retiming thegraph further (Figure 6)

Floyd-Warshall all pair shortest path algorithm isdesigned and implemented as a part of path solvers on FPGA[17] which reduces the computational burden of generalpurpose processor where actual retiming has been carriedout The speed of computation is also increased by a largerextent The HDL program for the shortest path solver onFPGA was designed based on the state diagram shown inFigure 8 Updating of the looping variables is done in 1198781 andthen transition from 1198781 to 1198780 occurs The transition from 1198780

to 1198782 occurs after the incidence matrix is completely copiedto the signal weight temp In state 1198782 the signal weight tempis operated upon to obtain the pair wise shortest path matrixwith state 1198783 enabling looping action Transition from 1198782 to1198783 takes place after each pair wise path distance is foundUpdating of the looping variables is done in 1198783 and thentransition from 1198783 to 1198782 occurs The transition from 1198782 to1198784 occurs after all the pair wise shortest paths are stored inthe signal weight temp In the state 1198784 the elements of thesignal matrix weight temp are copied to the output matrix

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

2 VLSI Design

purpose processor where actual retiming vectors are com-puted for digital filters the speed with which the retimingtransformation is performed suffers since the entire trans-formation code will be written as software Hence FPGAbased path solver architecture is designed in this paper whichaddresses the frequency issue in retiming and reduces theburden on general purpose processors

A computer aided design (CAD) tool framework calledDiFiDOT is developed which generates the synthesizablehardware descriptions of chosen digital filter with specifieduser constraints such as area speed and power Since thedigital filters are composed of adderssubtractors multipli-ers and delay elements DiFiDOT picks the best choiceof adders and multipliers as per users design constraintsAlso multiplication operation is expensive in terms of areapower and delay Exchanging multipliers with adders isadvantageous because adders weigh less than multipliers interms of silicon area [7] Since the coefficients to bemultipliedare known beforehand the full flexibility of multiplier is notnecessary in the design So a multiplierless design in digitalfilter is proposed under multiple constant multiplicationsarchitecture and an option is included in DiFiDOT for gener-ating multiplierless hardware descriptions This significantlyreduces the area of filters when compared to those designedusing multiplier blocks Here sharing of partial terms inmultiple constantmultiplications (MCMs) concept [8] is usedwhich reduces area and covers all possible partial terms thatmay be used to generate the set of coefficients in the MCMinstance For simulations the authors have used some ofACMSIGDA benchmark circuits

2 Background

High level transformation techniques are applied to getoptimal speed in sequential filter systems For the designedoptimization environment input is considered as data flowgraphs This section introduces the data flow graphs (DFGs)and problem definitions and gives an overview on previouslyproposed retiming transformation algorithms with theirdrawbacks

21 Data Flow Graphs (DFGs) Digital filters are importantpart of digital signal processor Their extraordinary perfor-mance is one of the parameters that made DSP become sopopular [9] Filters are used in audio processing speechprocessing (detection compression and reconstruction)modems motor control algorithms video and image pro-cessing and so forth Retiming is important step [10] in highlevel synthesis (HLS) of digital filters HLS is nothing butto map behavioural descriptions of algorithms to physicalrealizations All digital filters which are iterative recursiveand nonrecursive can be represented using data flow graphs(DFGs) [11 12] Any digital filter can be realized by functionalblocks such as addersubtractor and multiplier with delayelements A filter DFG consists of such functional blocks andconnectivity information for the data flow This example isa 4th-order low pass elliptic filter block High level trans-formations operate on the filter functional blocks for better

performance This can be done by changing the executionorder or by altering the number of functional blocks inthe critical path of retiming [3] process Performance canalso be improved by altering the architecture of functionalblockrsquos implementation characteristics without altering thefunctionality of the filter in the optimization environmentIn this paper MCM technique is used to achieve this Forapplying all these transformations input is given in the formof DFGsThe input DFGs can also be represented in the formof matrices for further computations In this paper 4th-orderlow pass elliptic filter is used as an application example toexplain the present work The elliptic filter response is givenby

1003816100381610038161003816119867 (119895119908)

1003816100381610038161003816

2

=

1

1 + 12059821198772

119899(120596 120574)

(1)

where 1198772119899(120596 120574) is the 119899th order Chebyshev rational function

with ripple parameter 120574 Let Ap be the maximum pass-bandloss and let As be the minimum stop-band loss in decibelsDepending on the need we can define 120596

119901and 120596

119904which are

pass-band cutoff frequency and stop-band cutoff frequencyThe selectivity factor 119896 is computed using 120596

119901and 120596

119904 With

these parameters modular constant 119902 = 119906 + 2119906

5+ 15119906

9+

150119906

13 is computed where 119906 is

119902 =

1 minus

4radic1 minus 1198962

2 (1 +

4radic1 minus 1198962)

(2)

The discrimination factor119863 and the filter order 119899 are used toobtain 119860

119904which is given by

119860119904= 10 log(1 + 10

11986011990110minus 1

16119902119899

) (3)

The second order elliptic filter block is as shown inFigure 1(a) The typical DFG for the filter is shown inFigure 1(b) For the designed optimization environment thefilter information is given in the form of matrices Thereare two matrices which represent the filter information Thenode-weight matrix represents node weights that are nothingbut computation time unit delays in the filter graph Thecomputational complexity of the adder is 119874(119899) whereas formultiplication it is 119874(1198992) for two 119899 digit numbers Multipli-cation computation complexity is higher when compared toaddition Hence in the present work time delay consideredfor multiplication in retiming is twice that of addition Inci-dence matrix defines the edge weights between all the nodeswhich represent connectivity information The number ofdelay elements present in between the computation nodes(adder and multiplier) is considered as edge weight If thereare no delay elements and adder or multiplier node is directlyconnected to another node then the edge weight will havezero value The node weight matrix and incidence matrixare used as the inputs for optimization environment wherehigh level transformations are applied to obtain performanceimprovement The critical path and shortest path of thefilter are computed for retiming which is one of the efficient

VLSI Design 3

1Output

-K-

b(5)

-K-b(4)

-K-b(3)

-K-

b(2)

-K-

b(1)

-K-

a(5)

-K-a(4)

-K-a(3)

-K-

a(2)

1In1

Zminus1Zminus1

Zminus1 Zminus1

Zminus1Zminus1

Zminus1 Zminus1

minus+ +

+

++

++

++

++

++

++

(a)

1

1

1

1 1

1

Adder 1

Adder 2

Adder 3

Adder 4

Adder 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Mult 11

Mult 12

Node 13

Mult 14

Mult 15Mult 16

Mult 17

(b)

Figure 1 (a) Block diagram of 4th-order low pass elliptic filter (b) DFG of elliptic filter block DFG

optimization techniques to obtain a filter solution withreduced clock period which in turn increases the filter speedThe critical path can be obtained by observing Figure 1(b)DFG The critical path is node 1 rarr 9 rarr 5 rarr 14 rarr 8

Retiming Transformation Retiming is a high level transfor-mation technique in which the location of the registers isaltered in such a way that the overall clock period reducesthereby increasing the clock frequency [13] This happensdue to reduction in the critical path which bounds thespeed of the design Due to intelligent placement of registersthe clock period gets minimised without altering the filterfunctionality Critical path is the longest computation pathin between computational elements [14] or delay elementsThe critical path can also be minimized by inserting thedelay elements on the primary inputs of the filter circuitand retiming the circuit This is called automatic pipeliningtechnique Both themethods are used to find the best optimalsolution in the present work Retiming for filter optimizationis found to be NP complete problem and time to find thesolution increases as the problem size increasesThere are twoways of applying retiming transformation

(i) retiming using clock period minimization method(ii) retiming using register minimization method

The retiming algorithm for clock period minimization isefficient in terms of clock frequency improvement Its com-putational complexity is 119874(1198993 log 119899) where 119899 is the numberof nodes which are nothing but computation elements suchas adders and multipliers The algorithm starts by building anew graph from the original DFG The new graph can giveus a set of inequalities called the critical path constraints

The original DFG also presents a set of equalities called thefeasibility constraints A constraint graph can be built fromthe critical path constraints and the feasibility constraintsThe retiming values for each node can be derived by applyinga Floyd-Warshall shortest path algorithm to the constraintgraph The weight for each edge in the retimed DFG can becalculated using the original weight and the retiming valuesof the two nodes are connected by this edge

(i) Calculate119872 = 119905max119899

where 119899 represents the numberof nodes in the original DFG 119866 and 119905max is themaximum computation time of all the nodes in theDFGAlso compute the critical pathwhich defines therequired clock period of original graph

(ii) A new DFG 119866

lowast can be created from 119866 119866lowast has thesame nodes and edges as 119866 For each edge in 119866lowast theedge weight is 119882lowast(119890) = MW(119890) minus 119905(119906) where 119882(119890)is the edge weight of the same edge in 119866 119905(119906) is thecomputation time of the node initiating this edge

(iii) We then apply the Floyd-Warshall shortest path algo-rithm to compute 119878lowast

119880119881 which represents the shortest

path from node 119880 to node 119881(iv) From 119878

lowast

119880119881 119882119880119881

and 119863119880119881

are calculated If 119880 = 119881then119882

119880119881= 119878

lowast

119880119881119872 and119863

119880119881= 119872119882

119880119881minus 119878

lowast

119880119881+ 119905(119881)

If 119880 = 119881 then119882119880119881

and 119863119880119881= 119905(119906) Here 119905(119880) and

119905(119881) represent the computation times of node 119880 andnode 119881 respectively

(v) We then find the maximum value of 119863119880119881

and theminimum values of 119863

119880119881 We check all the possible

clock periods starting frommaximumvalue of119863119880119881

tominimum value of119863

119880119881one by one If we find a clock

4 VLSI Design

period that can give us a feasible solution we stopand find theminimal clock period by solving for solu-tions critical path The solution contains the retimingvalues for all nodes Here 4th-order low pass IIRelliptic filter is designed and retimed using the abovementioned clock periodminimization algorithmTheretimed data flow graph with reduced clock period isshown in Figure 2

After applying retiming transformation to the filter thecritical path changes to 2 rarr 1 rarr 9

Since each delay element occupies about one-third ofthe binary adder it is important to reduce the number ofdelay elements [11] In retiming using register minimizationwe can obtain the digital filter that uses minimum numberof registers and satisfies the clock period constraints [15]Here forward splitting or register sharing [12] is used Ifthe node has several output edges carrying the same signalthe number of registers required to implement these edgesis the maximum number of registers on any one of theedges Consider Figure 3 The maximum number of registersrequired in Figure 3(a) is 6 whereas after register sharing thisgets reduced to 3 as shown in Figure 3(b)

The number of registers needed to construct this outputedges (119890) in retimed graph119882

119903and the total cost are

119877V = Max (119882119903(119890)) Cost = sum119877V (4)

The cost is WRT

(i) fan-out constraints 119877119881ge 119882119903for all 119881 and all edges

119881

119890

997888rarr 119886119899119910 119900119905ℎ119890119903 V119890119903119905119890119909

(ii) feasibility constraints 119903(119880) minus 119903(119881) ge 119882(119890) for everyedge 119880 119890997888rarr 119881

(iii) clock period constraints 119903(119880)minus119903(119881) ge 119882(119880119881)minus1

for all vertices such that 119863(119880119881) ge 119888 where 119888 is theclock period

This method makes use of gadgets to represent the nodeswith multiple edges The register minimization retiming canbe modeled as linear programming problem A dummy nodewith zero computation time will be introduced in this Theweight of the edge 119890

119894is defined to be119882(119890

119894) = 119882maxnot119882(119890119894)

where119882max = max(119882(119890119894)) where 1 le 119894 le 119870 where 119896 is the

number of edges available Also 120573 parameter is used which isthe breadth associated tomodel thememory required by edge119890119894 The breadth of each edge is inverse of 119896 A binary search

is performed for clock period and below is the procedureused while performing retiming using register minimisationThe register minimization retiming values can be obtained asbelow

(i) Use the gadgetmodel of the graph to compute the costfunction

(ii) Calculate 1198781015840 by using shortest path Floyd-Warshallalgorithm

(iii) Compute 119863(119880119881) and 119882(119880119881) matrices from theoriginal graph and 1198781015840 matrix

(iv) Perform LP formulation such that the cost functiongets minimized which is subjected to feasibility andclock period constraints

This LP problem is solved to obtain the retiming solutionwhich minimizes the number of registers by satisfying theclock period Figure 4 shows the DFG of 4th-order low passIIR elliptic filter It is observed that the register minimumretimed solution provides the filter solution with reducedregister count for reduced clock period However in somecases it is found that clock period minimization efficiencyreduces in comparison to clock period minimization retim-ing technique as the priority is given to the register countFor the considered elliptic filter for a clock period of 4 unitsit is found that the register count gets minimized to 9 Afterapplying register minimization retiming transformation tothe filter the critical path changes to 1 rarr 9 rarr 5

Problem Formulation Critical path and shortest path solvingcontribute to most of the computation time in retiming

Definition 1 (the path solver problem) Let 119878 =

1199040 1199041 1199042 1199043 119904

119896 where 119896 is the maximum number

of feasible solutions available for retiming of a consideredfilter DFG During retiming of digital filters in high levelsynthesis the shortest path between the nodes must becomputed for (119896 + 1) times where 119896 is the number offeasible solutions available for the DFG which is nothingbut unique entries in path delay 119863 matrix Similarlythe critical path must be computed for (119896 + 1) Generalpurpose processors (GPPs) where retiming algorithm isimplemented are fully programmable but are less efficientin terms of power and performance Hence the problem isto improve the performance and power of retiming usingFPGA based path solvers Further along with retiming highlevel transformation technique called automatic pipeline isapplied to improve the filter speed

Definition 2 (multiple constant multiplication in digitalfilters) For the considered filter coefficient constant 119879 inthe retimed filters find the set of multiplierless operations1198741 1198742 1198743 119874

119899 with minimum number of addition

subtraction and shift operations using multiple constantmultiplier architecture to optimize the filter architecturefurther

Definition 3 (optimization and automation of filter HDL)An environment needs to be developed to obtain HDLsof retimed filters in which user can choose different datapath element architectures depending on the specificationsThis reduces time to market and helps to evaluate a lotof hardware implementation trade-offs Filter equivalencechecking after applying high level transformation needs tobe done which needs to be developed as a part of theoptimization environment

Principle of Shortest Path andMCMAlgorithm Several FPGAsynthesis algorithms have been proposed specifically forsequential circuits In [16] authors have proposed how to

VLSI Design 5

11

2

1

1

1

1

1

2

1

Adder 1

Adder 2

Adder 3

Adder 4

Adder 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Adder 11

Mult 12

Mult 13Mult 14

Mult 15

Mult 16

Mult 17

(a)

0

2

4

6

8

10

12

Clock period in time unitsRegister count

Filter after clock period minimization retiming

Filter before retiming

(b)

Figure 2 4th-order elliptic filter after clock period minimization retiming (a) DFG after retiming (b) clock period and register count beforeand after retiming

2

3

1

(a)

1 11

(b)

Figure 3 (a) Graph before register sharing (b) Graph after register sharing

map retimed circuits on to FPGAs efficiently However inthis paper authors suggest a method for efficient retimingprocess using FPGA based path solvers This can be appliedto any retiming techniques available in literature Shortestpath is solved in filter DFG using Floyd-Warshall algorithmThe Floyd-Warshall algorithm uses an approach of dynamicprogramming to solve the shortest-paths problem on a DFGThe Floyd-Warshall Algorithm can solve the shortest pathproblem in 119874(1198993) time where 119899 is the number of nodes inthe DFG Let 119889

119894119895(119896)denote the weight of the shortest path

from 119894 to 119895 such that all intermediate vertices are containedin the set 1 2 119896 That is the path 119901 is decomposed into119894 rarr 119896 rarr 119895 Let the vertices in the graph be numberedfrom 1 2 119899 Consider the subset 1 2 119896 of these 119899vertices Find the shortest path from vertex 119894 to vertex 119895 thatuses vertices in the set 1 2 119896 only Then there are twosituations possible

(i) 119896 is an intermediate vertex on the shortest path

(ii) 119896 is not an intermediate vertex on the shortest path

If the vertex 119896 is not an intermediate vertex on 119901 then

119889119894119895(119896) = 119889

119894119895(119896 minus 1) else 119889

119894119895(119896) = 119889

119894119896(119896 minus 1) + 119889

119896119895(119896 minus 1)

(5)

In either case the subpaths contain nodes from 1 2 (119896minus

1) Therefore

119889119894119895(119896) = 119889

119894119895(119896 minus 1) + 119889

119896119895(119896 minus 1) (6)

When 119896 = 0 then

119889119894119895(0) = 119882

119894119895

and 119894119891 1198960 then 119889119894119895(119896) = min 119889

119894119895(119896 minus 1) + 119889

119894119895(119896 minus 1)

(7)

Let119863 be the incidence matrix with the graph edge weightinformation119882 initially119863 is then updatedwith the calculatedshortest paths see Algorithm 1

The final 119863 matrix will store all the shortest paths Thisalgorithm is extended for retiming of digital filters

The multiple constant multiplication (MCM) problem isaddressed in the literature [14] using either graph basedmeth-ods or using common subexpression elimination method Incommon subexpression elimination algorithm all possiblesubexpressions are extracted for a variable But this is possibleonly if it is defined as minimum signed digit and as canonicalsigned digit Then the subexpression is found such that itcan be shared by multiple constant multiplication valuesIn this paper the above two concepts are extended forautomatic pipelining and retiming of digital filters in high

6 VLSI Design

11

12

1

11

1

11

1 1

1

2

Adder 1

Adder 2

Adder 3

Adder 4

Add 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Mult 11

Mult 12

Mult 13

Mult 14Mult 15

Mult 16

Mult 17

(a)

0123456789

Clock period in time unitsRegister count

Filter before retiming Filter after registerminimization retiming

(b)

Figure 4 4th-order elliptic filter after register minimization retiming (a) DFG after retiming (b) clock period and register count before andafter retiming

(1) n = of rows in W 119863

0 = W

(2) for(k=1 to n)

(3) for(i=1 to n)

(4) for(j=1 to n)

(5) 119889

119896

119894119895= min119889(119896minus1)

119894119895 119889

(119896minus1)

119894119896+ 119889

(119896minus1)

119894119896

(6) end for

(7) end for

(8) end for

(9) return 119863

119899

Algorithm 1

level synthesis In all the digital filters the filter coefficientsare known beforehand Hence full flexibility of themultiplieris not necessary and we can make use of MCM designsThis method is more efficient when compared to shift andadd multiplications as intermediate results can be sharedwhich reduces the area of multiplierless implementation ofdigital filters The sharing of intermediate result will providepotential area saving with increased filter order (Figure 5)

Consider the filter coefficient set which is to be used forthe filter design given by119879 = 119888

1 1198882 1198883 119888

119899 we need to find

the smallest set 119878 given by 1198861 1198862 1198863 s

1 1199042 1199043 where

119886 (addersSubtractors) amp 119904 (shifts) lt 119878 such that the set ismade of adderssubtracters shifters and 119860 operations Hereshift operations also can be shared across multiple points sothat the output set is optimum Here 119867cub algorithm [8] isused to generate corresponding DFG for the multiplier blockimplementing the parallelmultiplications 119888

1lowast119909 1198882lowast119909 119888

119899lowast

119909 The only operations used in the generated DAG andinput design matrices are additions subtractions shifts andnegations In this paper performance of MCM based filterdesigns is further improved by combining this approach withretiming The multiplierless filter circuit is further retimed

to reduce the overall clock period which increases the clockfrequency

Consider 1198971and 1198972as two integerswhich specifies left shifts

and 119903 ge 0 specifies right shift and let 119904 be the sign bit whichcan be 0 1 An119860 operation is an operation with two integerinputs 119906 and V and one fundamental output which is definedas

119860119901(119906 V) = 100381610038161003816

1003816(119906 ≪ 119897

1) + (1) 119904 (V ≪ 119897

2)

1003816100381610038161003816

≫ 119903 = 2

1198971119906 + (minus1)

1199042

1198972V | 2minus119903

(8)

where≪ is a left binary shift≫ is a right binary shift and 119901 =1198971 1198972 119903 119904 is the parameter set or the 119860 configuration of 119860

119901

To preserve all significant bits of the output 2119903 must divide2

1198971119906 + (minus1)

1199042

1198972V The left shifts are limited to the bit width of

the target All 119860 operations are used to build 119860 minus 119892119903119886119901ℎ Fora given set of target filter coefficients119862 we can find set 119878 suchthat multiplierless digital filter is designed

VLSI Design 7

MCMblock

Y

k1Y

k2Y

k3Y

k4Y

k5Y

(a)

Y

8Y

32Y

(minus1)7Y

(minus1)31Y

O1

O2

14Y

O3 = 45Y = 31Y + 14Y

(+)45Y

O3

O1 = 7Y = 8Y minus 1Y

O2 = 31Y = 32Y minus 1Y

≪3

≪5

≪1

(b)

Figure 5 Example for addressing MCM problem in digital filters

3 Design and Analysis

EachDSPfilter block is associatedwith the critical pathwhichlimits maximum iteration period in the filter design [12]This can be reduced by retiming where the clock period getsreduced and increases the clock speed To reduce the criticalpath we need to find the original critical path of the circuitusing critical path solving algorithm and then apply retimingtransformation to digital filter While retiming shortest pathalgorithm is required for solving the system inequalitiesFPGAs are nothing but set of configurable logic blockswith configurable interconnects Designer can program itto work like a specific hardware These give great speedupover general purpose processors for many long runningalgorithms Hence for high performance systems FPGAsbecome a better choice In the present work path solvers areimplemented on FPGA to increase the performance

31 Critical Path Solver Algorithm Design and Analysis Thecritical path is defined as maximum delay path between theoutput node and node causing the state change of the outputnode with zero delay The significance of the critical pathis that it determines the operating frequency of the designIn retiming which is one among the steps in high levelsynthesis it is imperative that we find the critical path [17]in real time To speed up this process the use of a dedicatedFPGA hardware can speed up the process with low powerConsider 120572 = Number of adder elements and 120573 = Numberof multiplier elements in the considered digital filter Let= 1198991 1198992 119899

119894 where 119894 = 120572 + 120573 which is maximum

combinational adder and multiplier elements Consider 119874 =1199001 1199002 119900

119895where 119874 is the set of output nodes in the filter

circuit and 119868 = 1198941 1198942 119894

119896 where 119868 is the set of input

nodes in the filter circuit such that IN and ON The criticalpath of the circuit is defined in terms of 120574

1198991

which is thedelay of individual combinational block In this procedureof computing critical path on FPGA it sorts the verticessuch that vertices occurring early in the list are connected tovertices later in the list by edges having zero delays While

sorting if the vertex is connected to previous one then pathlength is sum of its time with the sum of all the vertices foundin the path otherwise path length of the node is equal toits own computation time We need this for constructing theretimed graph as well as verifying the retimed graph resultThe equation of the critical path is

120574119898=

119894=119873

sum

119894=1

1199051198981

(9)

where 119873 is the sum of adder and multiplier elements in thetopologically sorted vertices connectedwith zero delay edgesThe delay of the circuit is given by 119905

119889= max120574

119898 where

119905119889is the delay of the critical path Algorithm 2 shows the

critical path formulation In the considered optimizationenvironment the below steps are used for critical pathcomputation

(i) The filter network graph is considered as input tocritical path solver algorithm

(ii) All the zero-weight edges in the network graph arefound and a matrix of their source and destinationnodes is formed

(iii) For each row in the above matrix if the destinationnode of any zero-weight edge path is the same as thesource node of the zero-weight edge path the twopaths are joined This step is repeated to obtain amatrix whose rows will have nodes of all the possiblezero-weight edge paths in the graph

(iv) The computational time of each zero-weight edgepath from this matrix is calculated

(v) The zero-weight edge path with the greatest compu-tational time is found This is the critical path and itscomputational time is the critical path delay

A critical path solver algorithm is designed in the presentwork on FPGA The state diagram for the implementedcritical path solver is given in Figure 6 In 1198780 the filter graphor matrix is given as input to the critical path solver module

8 VLSI Design

Loop indices updated

Loop indices updated

Next greater path delay found

Loop indices updated

Loop indices updated

All

path

del

ay fo

und

Superset of zero-weight path found

All

zero

pat

h de

lay

foun

d

Weight temp updated

Weight temp updated

S0

S1 S2

S3 S4S5S6

S7S8

Incidence and node matrix

obtained

Greates

t path

delay

found

Zero-weight path updated

Figure 6 Critical path solver state diagram

(1) Algorithm for computing the critical path

(2) Input a DFG of G = (VEtd) Where c is the

(3) computation time of the node and d

(4) is the initial delay on edge E

(5) Output Critical path C

(6) Sort all the vertices topologically in the DFG G

(7) with v fallowing u

(8) if there is a zero delay edge from 119906 rarr V(9) For all vertices from the sorted list

(10) If non zero delay on the edge E in G then

(11) 120574119894= 119905V

(12) else(13) 120574

119894= 119905V + max(120574

119894) isin edge 119890 119906 rarr V in 119866 with 119889

119890= 0

(14) end if

(15) 120574 = 1205741 1205742 120574

119898

(16) where m = number of entries in the topologically sorted list

(17) end for

(18) compute 120574 = max120574

Algorithm 2

Since HDL does not provide a method to represent infinitysome number say 255 can be chosen which is always greaterthan any otherweight in the incidencematrix Also since edgeweight 0 is a valid input any negative number say minus1 canbe used to denote the uninitialized matrix element In state1198781 all the zero weight edges in the DFG are found alongwith their source and destination nodes and are stored in amatrix called zero weight pathThe zero weight pathmatrixcontains two columns The first column contains the sourcenode of a directed zero-weight edge while the second columnhas the destination node of the directed zero-weight edgeSimultaneously we will keep a count on the number of zero-weight edges

The state 1198782 is provided to enable looping action andfor updating of all the signals In state 1198783 in each row ofzero weight path matrix the module will find the next node

with a zero-weight edge connecting it to the node in theprevious column (if it exists) Thus if the destination node inany zero-weight path is same as the source node in anotherzero-weight path the two paths are concatenated that is ifthe destination node in path 119886 is the source node in path 119887then we make the destination node in 119887 as the destinationnode of 119886 The state 1198784 is provided to enable looping actionand for updating all the signals At the end of this state the119911119890119903119900 119908119890119894119892ℎ119905 119901119886119905ℎ matrix will contain only those supersetpaths that are a superset of the remaining zero weight pathsIn state 1198785 the module calculates the sum of all the nodeweights through each of these paths State 1198786 is provided forlooping action and for updating all signals

In state 1198787 the path with the highest node weights sum isfound which is the critical path of the DFG All the nodes inthis path are then stored in order in amatrix called the critical

VLSI Design 9

(1) Algorithm for computing the shortest path

(2) Input a DFG of G = (VEtd) Where c is the computation time of the node and d

(3) is the initial delay on edge E

(4) Output All pair shortest path matrix M

(5) for i = 1 to N

(6) for j = 1 to N

(7) if i = j then

(8) M[ij] = (00)

(9) else M[ij] = inf

(10) end for

(11) end for

(12) for all the edges 119890 119906 rarr V119872[119906 V] = 119889 for edge e

(13) for 119896 rarr 1 to N

(14) for 119894 rarr 1 to N

(15) for 119895 rarr 1 to N

(16) if 119872[119894 119895] gt 119872[119894 119896] + 119872[119896 119895]

(17) M[ij] = M[ik] + M[kj]

(18) end for

(19) end for

(20) end for

(21) Output shortest path matrix M

Algorithm 3

1 2 3 4 50

2

4

6

8

10

12

14

16

18

Vertex number in filter DFG

Zero

del

ay p

ath

for fi

lter D

FG

Figure 7 Zero path delays and critical path for 4th-order low passelliptic filter

path matrix These signals in this matrix are output as thecritical pathThe state 1198788 is provided to enable looping actionand for updating all the signals The state machine then goesback to state 1198780 and awaits new inputs Next algorithm to findthe shortest path between two nodes in a graph is describedFor retiming technique in high level synthesis we need theshortest path to solve system of inequalities It is seen thattime needed to compute critical path on FPGA is reasonablyless when compared to computation on general purposeprocessor This also reduces the retiming computation timeThe zero delay paths are computed for 4th-order elliptic filtershown in Figure 7 The highlighted path delay is from1 rarr

9 rarr 5 rarr 14 rarr 8 where nodes 1 5 8 are adders and 9 14

are multipliers Maximum path delay which is highlighted isconsidered to be the critical path

32 Shortest Path Solver Algorithm and State Diagram Let119863(119906 V) be the maximum delay between nodes 119906 and V andlet 119879(119906 V) be total computation time of zero delay path from119906 to V We can check the condition 119879(119906 V) minusmin119905(119906) 119905(V) gt119889119890119903119894V119890119889 119888119897119900119888119896 119901119890119903119894119900119889 then select those paths to retime sothat computation time in this path can be reduced Wehave to retime the edges by constructing system of linearinequalities This can be done using Floyd-Warshall shortestpath algorithmAlgorithm 3This can be used for retiming thegraph further (Figure 6)

Floyd-Warshall all pair shortest path algorithm isdesigned and implemented as a part of path solvers on FPGA[17] which reduces the computational burden of generalpurpose processor where actual retiming has been carriedout The speed of computation is also increased by a largerextent The HDL program for the shortest path solver onFPGA was designed based on the state diagram shown inFigure 8 Updating of the looping variables is done in 1198781 andthen transition from 1198781 to 1198780 occurs The transition from 1198780

to 1198782 occurs after the incidence matrix is completely copiedto the signal weight temp In state 1198782 the signal weight tempis operated upon to obtain the pair wise shortest path matrixwith state 1198783 enabling looping action Transition from 1198782 to1198783 takes place after each pair wise path distance is foundUpdating of the looping variables is done in 1198783 and thentransition from 1198783 to 1198782 occurs The transition from 1198782 to1198784 occurs after all the pair wise shortest paths are stored inthe signal weight temp In the state 1198784 the elements of thesignal matrix weight temp are copied to the output matrix

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

VLSI Design 3

1Output

-K-

b(5)

-K-b(4)

-K-b(3)

-K-

b(2)

-K-

b(1)

-K-

a(5)

-K-a(4)

-K-a(3)

-K-

a(2)

1In1

Zminus1Zminus1

Zminus1 Zminus1

Zminus1Zminus1

Zminus1 Zminus1

minus+ +

+

++

++

++

++

++

++

(a)

1

1

1

1 1

1

Adder 1

Adder 2

Adder 3

Adder 4

Adder 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Mult 11

Mult 12

Node 13

Mult 14

Mult 15Mult 16

Mult 17

(b)

Figure 1 (a) Block diagram of 4th-order low pass elliptic filter (b) DFG of elliptic filter block DFG

optimization techniques to obtain a filter solution withreduced clock period which in turn increases the filter speedThe critical path can be obtained by observing Figure 1(b)DFG The critical path is node 1 rarr 9 rarr 5 rarr 14 rarr 8

Retiming Transformation Retiming is a high level transfor-mation technique in which the location of the registers isaltered in such a way that the overall clock period reducesthereby increasing the clock frequency [13] This happensdue to reduction in the critical path which bounds thespeed of the design Due to intelligent placement of registersthe clock period gets minimised without altering the filterfunctionality Critical path is the longest computation pathin between computational elements [14] or delay elementsThe critical path can also be minimized by inserting thedelay elements on the primary inputs of the filter circuitand retiming the circuit This is called automatic pipeliningtechnique Both themethods are used to find the best optimalsolution in the present work Retiming for filter optimizationis found to be NP complete problem and time to find thesolution increases as the problem size increasesThere are twoways of applying retiming transformation

(i) retiming using clock period minimization method(ii) retiming using register minimization method

The retiming algorithm for clock period minimization isefficient in terms of clock frequency improvement Its com-putational complexity is 119874(1198993 log 119899) where 119899 is the numberof nodes which are nothing but computation elements suchas adders and multipliers The algorithm starts by building anew graph from the original DFG The new graph can giveus a set of inequalities called the critical path constraints

The original DFG also presents a set of equalities called thefeasibility constraints A constraint graph can be built fromthe critical path constraints and the feasibility constraintsThe retiming values for each node can be derived by applyinga Floyd-Warshall shortest path algorithm to the constraintgraph The weight for each edge in the retimed DFG can becalculated using the original weight and the retiming valuesof the two nodes are connected by this edge

(i) Calculate119872 = 119905max119899

where 119899 represents the numberof nodes in the original DFG 119866 and 119905max is themaximum computation time of all the nodes in theDFGAlso compute the critical pathwhich defines therequired clock period of original graph

(ii) A new DFG 119866

lowast can be created from 119866 119866lowast has thesame nodes and edges as 119866 For each edge in 119866lowast theedge weight is 119882lowast(119890) = MW(119890) minus 119905(119906) where 119882(119890)is the edge weight of the same edge in 119866 119905(119906) is thecomputation time of the node initiating this edge

(iii) We then apply the Floyd-Warshall shortest path algo-rithm to compute 119878lowast

119880119881 which represents the shortest

path from node 119880 to node 119881(iv) From 119878

lowast

119880119881 119882119880119881

and 119863119880119881

are calculated If 119880 = 119881then119882

119880119881= 119878

lowast

119880119881119872 and119863

119880119881= 119872119882

119880119881minus 119878

lowast

119880119881+ 119905(119881)

If 119880 = 119881 then119882119880119881

and 119863119880119881= 119905(119906) Here 119905(119880) and

119905(119881) represent the computation times of node 119880 andnode 119881 respectively

(v) We then find the maximum value of 119863119880119881

and theminimum values of 119863

119880119881 We check all the possible

clock periods starting frommaximumvalue of119863119880119881

tominimum value of119863

119880119881one by one If we find a clock

4 VLSI Design

period that can give us a feasible solution we stopand find theminimal clock period by solving for solu-tions critical path The solution contains the retimingvalues for all nodes Here 4th-order low pass IIRelliptic filter is designed and retimed using the abovementioned clock periodminimization algorithmTheretimed data flow graph with reduced clock period isshown in Figure 2

After applying retiming transformation to the filter thecritical path changes to 2 rarr 1 rarr 9

Since each delay element occupies about one-third ofthe binary adder it is important to reduce the number ofdelay elements [11] In retiming using register minimizationwe can obtain the digital filter that uses minimum numberof registers and satisfies the clock period constraints [15]Here forward splitting or register sharing [12] is used Ifthe node has several output edges carrying the same signalthe number of registers required to implement these edgesis the maximum number of registers on any one of theedges Consider Figure 3 The maximum number of registersrequired in Figure 3(a) is 6 whereas after register sharing thisgets reduced to 3 as shown in Figure 3(b)

The number of registers needed to construct this outputedges (119890) in retimed graph119882

119903and the total cost are

119877V = Max (119882119903(119890)) Cost = sum119877V (4)

The cost is WRT

(i) fan-out constraints 119877119881ge 119882119903for all 119881 and all edges

119881

119890

997888rarr 119886119899119910 119900119905ℎ119890119903 V119890119903119905119890119909

(ii) feasibility constraints 119903(119880) minus 119903(119881) ge 119882(119890) for everyedge 119880 119890997888rarr 119881

(iii) clock period constraints 119903(119880)minus119903(119881) ge 119882(119880119881)minus1

for all vertices such that 119863(119880119881) ge 119888 where 119888 is theclock period

This method makes use of gadgets to represent the nodeswith multiple edges The register minimization retiming canbe modeled as linear programming problem A dummy nodewith zero computation time will be introduced in this Theweight of the edge 119890

119894is defined to be119882(119890

119894) = 119882maxnot119882(119890119894)

where119882max = max(119882(119890119894)) where 1 le 119894 le 119870 where 119896 is the

number of edges available Also 120573 parameter is used which isthe breadth associated tomodel thememory required by edge119890119894 The breadth of each edge is inverse of 119896 A binary search

is performed for clock period and below is the procedureused while performing retiming using register minimisationThe register minimization retiming values can be obtained asbelow

(i) Use the gadgetmodel of the graph to compute the costfunction

(ii) Calculate 1198781015840 by using shortest path Floyd-Warshallalgorithm

(iii) Compute 119863(119880119881) and 119882(119880119881) matrices from theoriginal graph and 1198781015840 matrix

(iv) Perform LP formulation such that the cost functiongets minimized which is subjected to feasibility andclock period constraints

This LP problem is solved to obtain the retiming solutionwhich minimizes the number of registers by satisfying theclock period Figure 4 shows the DFG of 4th-order low passIIR elliptic filter It is observed that the register minimumretimed solution provides the filter solution with reducedregister count for reduced clock period However in somecases it is found that clock period minimization efficiencyreduces in comparison to clock period minimization retim-ing technique as the priority is given to the register countFor the considered elliptic filter for a clock period of 4 unitsit is found that the register count gets minimized to 9 Afterapplying register minimization retiming transformation tothe filter the critical path changes to 1 rarr 9 rarr 5

Problem Formulation Critical path and shortest path solvingcontribute to most of the computation time in retiming

Definition 1 (the path solver problem) Let 119878 =

1199040 1199041 1199042 1199043 119904

119896 where 119896 is the maximum number

of feasible solutions available for retiming of a consideredfilter DFG During retiming of digital filters in high levelsynthesis the shortest path between the nodes must becomputed for (119896 + 1) times where 119896 is the number offeasible solutions available for the DFG which is nothingbut unique entries in path delay 119863 matrix Similarlythe critical path must be computed for (119896 + 1) Generalpurpose processors (GPPs) where retiming algorithm isimplemented are fully programmable but are less efficientin terms of power and performance Hence the problem isto improve the performance and power of retiming usingFPGA based path solvers Further along with retiming highlevel transformation technique called automatic pipeline isapplied to improve the filter speed

Definition 2 (multiple constant multiplication in digitalfilters) For the considered filter coefficient constant 119879 inthe retimed filters find the set of multiplierless operations1198741 1198742 1198743 119874

119899 with minimum number of addition

subtraction and shift operations using multiple constantmultiplier architecture to optimize the filter architecturefurther

Definition 3 (optimization and automation of filter HDL)An environment needs to be developed to obtain HDLsof retimed filters in which user can choose different datapath element architectures depending on the specificationsThis reduces time to market and helps to evaluate a lotof hardware implementation trade-offs Filter equivalencechecking after applying high level transformation needs tobe done which needs to be developed as a part of theoptimization environment

Principle of Shortest Path andMCMAlgorithm Several FPGAsynthesis algorithms have been proposed specifically forsequential circuits In [16] authors have proposed how to

VLSI Design 5

11

2

1

1

1

1

1

2

1

Adder 1

Adder 2

Adder 3

Adder 4

Adder 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Adder 11

Mult 12

Mult 13Mult 14

Mult 15

Mult 16

Mult 17

(a)

0

2

4

6

8

10

12

Clock period in time unitsRegister count

Filter after clock period minimization retiming

Filter before retiming

(b)

Figure 2 4th-order elliptic filter after clock period minimization retiming (a) DFG after retiming (b) clock period and register count beforeand after retiming

2

3

1

(a)

1 11

(b)

Figure 3 (a) Graph before register sharing (b) Graph after register sharing

map retimed circuits on to FPGAs efficiently However inthis paper authors suggest a method for efficient retimingprocess using FPGA based path solvers This can be appliedto any retiming techniques available in literature Shortestpath is solved in filter DFG using Floyd-Warshall algorithmThe Floyd-Warshall algorithm uses an approach of dynamicprogramming to solve the shortest-paths problem on a DFGThe Floyd-Warshall Algorithm can solve the shortest pathproblem in 119874(1198993) time where 119899 is the number of nodes inthe DFG Let 119889

119894119895(119896)denote the weight of the shortest path

from 119894 to 119895 such that all intermediate vertices are containedin the set 1 2 119896 That is the path 119901 is decomposed into119894 rarr 119896 rarr 119895 Let the vertices in the graph be numberedfrom 1 2 119899 Consider the subset 1 2 119896 of these 119899vertices Find the shortest path from vertex 119894 to vertex 119895 thatuses vertices in the set 1 2 119896 only Then there are twosituations possible

(i) 119896 is an intermediate vertex on the shortest path

(ii) 119896 is not an intermediate vertex on the shortest path

If the vertex 119896 is not an intermediate vertex on 119901 then

119889119894119895(119896) = 119889

119894119895(119896 minus 1) else 119889

119894119895(119896) = 119889

119894119896(119896 minus 1) + 119889

119896119895(119896 minus 1)

(5)

In either case the subpaths contain nodes from 1 2 (119896minus

1) Therefore

119889119894119895(119896) = 119889

119894119895(119896 minus 1) + 119889

119896119895(119896 minus 1) (6)

When 119896 = 0 then

119889119894119895(0) = 119882

119894119895

and 119894119891 1198960 then 119889119894119895(119896) = min 119889

119894119895(119896 minus 1) + 119889

119894119895(119896 minus 1)

(7)

Let119863 be the incidence matrix with the graph edge weightinformation119882 initially119863 is then updatedwith the calculatedshortest paths see Algorithm 1

The final 119863 matrix will store all the shortest paths Thisalgorithm is extended for retiming of digital filters

The multiple constant multiplication (MCM) problem isaddressed in the literature [14] using either graph basedmeth-ods or using common subexpression elimination method Incommon subexpression elimination algorithm all possiblesubexpressions are extracted for a variable But this is possibleonly if it is defined as minimum signed digit and as canonicalsigned digit Then the subexpression is found such that itcan be shared by multiple constant multiplication valuesIn this paper the above two concepts are extended forautomatic pipelining and retiming of digital filters in high

6 VLSI Design

11

12

1

11

1

11

1 1

1

2

Adder 1

Adder 2

Adder 3

Adder 4

Add 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Mult 11

Mult 12

Mult 13

Mult 14Mult 15

Mult 16

Mult 17

(a)

0123456789

Clock period in time unitsRegister count

Filter before retiming Filter after registerminimization retiming

(b)

Figure 4 4th-order elliptic filter after register minimization retiming (a) DFG after retiming (b) clock period and register count before andafter retiming

(1) n = of rows in W 119863

0 = W

(2) for(k=1 to n)

(3) for(i=1 to n)

(4) for(j=1 to n)

(5) 119889

119896

119894119895= min119889(119896minus1)

119894119895 119889

(119896minus1)

119894119896+ 119889

(119896minus1)

119894119896

(6) end for

(7) end for

(8) end for

(9) return 119863

119899

Algorithm 1

level synthesis In all the digital filters the filter coefficientsare known beforehand Hence full flexibility of themultiplieris not necessary and we can make use of MCM designsThis method is more efficient when compared to shift andadd multiplications as intermediate results can be sharedwhich reduces the area of multiplierless implementation ofdigital filters The sharing of intermediate result will providepotential area saving with increased filter order (Figure 5)

Consider the filter coefficient set which is to be used forthe filter design given by119879 = 119888

1 1198882 1198883 119888

119899 we need to find

the smallest set 119878 given by 1198861 1198862 1198863 s

1 1199042 1199043 where

119886 (addersSubtractors) amp 119904 (shifts) lt 119878 such that the set ismade of adderssubtracters shifters and 119860 operations Hereshift operations also can be shared across multiple points sothat the output set is optimum Here 119867cub algorithm [8] isused to generate corresponding DFG for the multiplier blockimplementing the parallelmultiplications 119888

1lowast119909 1198882lowast119909 119888

119899lowast

119909 The only operations used in the generated DAG andinput design matrices are additions subtractions shifts andnegations In this paper performance of MCM based filterdesigns is further improved by combining this approach withretiming The multiplierless filter circuit is further retimed

to reduce the overall clock period which increases the clockfrequency

Consider 1198971and 1198972as two integerswhich specifies left shifts

and 119903 ge 0 specifies right shift and let 119904 be the sign bit whichcan be 0 1 An119860 operation is an operation with two integerinputs 119906 and V and one fundamental output which is definedas

119860119901(119906 V) = 100381610038161003816

1003816(119906 ≪ 119897

1) + (1) 119904 (V ≪ 119897

2)

1003816100381610038161003816

≫ 119903 = 2

1198971119906 + (minus1)

1199042

1198972V | 2minus119903

(8)

where≪ is a left binary shift≫ is a right binary shift and 119901 =1198971 1198972 119903 119904 is the parameter set or the 119860 configuration of 119860

119901

To preserve all significant bits of the output 2119903 must divide2

1198971119906 + (minus1)

1199042

1198972V The left shifts are limited to the bit width of

the target All 119860 operations are used to build 119860 minus 119892119903119886119901ℎ Fora given set of target filter coefficients119862 we can find set 119878 suchthat multiplierless digital filter is designed

VLSI Design 7

MCMblock

Y

k1Y

k2Y

k3Y

k4Y

k5Y

(a)

Y

8Y

32Y

(minus1)7Y

(minus1)31Y

O1

O2

14Y

O3 = 45Y = 31Y + 14Y

(+)45Y

O3

O1 = 7Y = 8Y minus 1Y

O2 = 31Y = 32Y minus 1Y

≪3

≪5

≪1

(b)

Figure 5 Example for addressing MCM problem in digital filters

3 Design and Analysis

EachDSPfilter block is associatedwith the critical pathwhichlimits maximum iteration period in the filter design [12]This can be reduced by retiming where the clock period getsreduced and increases the clock speed To reduce the criticalpath we need to find the original critical path of the circuitusing critical path solving algorithm and then apply retimingtransformation to digital filter While retiming shortest pathalgorithm is required for solving the system inequalitiesFPGAs are nothing but set of configurable logic blockswith configurable interconnects Designer can program itto work like a specific hardware These give great speedupover general purpose processors for many long runningalgorithms Hence for high performance systems FPGAsbecome a better choice In the present work path solvers areimplemented on FPGA to increase the performance

31 Critical Path Solver Algorithm Design and Analysis Thecritical path is defined as maximum delay path between theoutput node and node causing the state change of the outputnode with zero delay The significance of the critical pathis that it determines the operating frequency of the designIn retiming which is one among the steps in high levelsynthesis it is imperative that we find the critical path [17]in real time To speed up this process the use of a dedicatedFPGA hardware can speed up the process with low powerConsider 120572 = Number of adder elements and 120573 = Numberof multiplier elements in the considered digital filter Let= 1198991 1198992 119899

119894 where 119894 = 120572 + 120573 which is maximum

combinational adder and multiplier elements Consider 119874 =1199001 1199002 119900

119895where 119874 is the set of output nodes in the filter

circuit and 119868 = 1198941 1198942 119894

119896 where 119868 is the set of input

nodes in the filter circuit such that IN and ON The criticalpath of the circuit is defined in terms of 120574

1198991

which is thedelay of individual combinational block In this procedureof computing critical path on FPGA it sorts the verticessuch that vertices occurring early in the list are connected tovertices later in the list by edges having zero delays While

sorting if the vertex is connected to previous one then pathlength is sum of its time with the sum of all the vertices foundin the path otherwise path length of the node is equal toits own computation time We need this for constructing theretimed graph as well as verifying the retimed graph resultThe equation of the critical path is

120574119898=

119894=119873

sum

119894=1

1199051198981

(9)

where 119873 is the sum of adder and multiplier elements in thetopologically sorted vertices connectedwith zero delay edgesThe delay of the circuit is given by 119905

119889= max120574

119898 where

119905119889is the delay of the critical path Algorithm 2 shows the

critical path formulation In the considered optimizationenvironment the below steps are used for critical pathcomputation

(i) The filter network graph is considered as input tocritical path solver algorithm

(ii) All the zero-weight edges in the network graph arefound and a matrix of their source and destinationnodes is formed

(iii) For each row in the above matrix if the destinationnode of any zero-weight edge path is the same as thesource node of the zero-weight edge path the twopaths are joined This step is repeated to obtain amatrix whose rows will have nodes of all the possiblezero-weight edge paths in the graph

(iv) The computational time of each zero-weight edgepath from this matrix is calculated

(v) The zero-weight edge path with the greatest compu-tational time is found This is the critical path and itscomputational time is the critical path delay

A critical path solver algorithm is designed in the presentwork on FPGA The state diagram for the implementedcritical path solver is given in Figure 6 In 1198780 the filter graphor matrix is given as input to the critical path solver module

8 VLSI Design

Loop indices updated

Loop indices updated

Next greater path delay found

Loop indices updated

Loop indices updated

All

path

del

ay fo

und

Superset of zero-weight path found

All

zero

pat

h de

lay

foun

d

Weight temp updated

Weight temp updated

S0

S1 S2

S3 S4S5S6

S7S8

Incidence and node matrix

obtained

Greates

t path

delay

found

Zero-weight path updated

Figure 6 Critical path solver state diagram

(1) Algorithm for computing the critical path

(2) Input a DFG of G = (VEtd) Where c is the

(3) computation time of the node and d

(4) is the initial delay on edge E

(5) Output Critical path C

(6) Sort all the vertices topologically in the DFG G

(7) with v fallowing u

(8) if there is a zero delay edge from 119906 rarr V(9) For all vertices from the sorted list

(10) If non zero delay on the edge E in G then

(11) 120574119894= 119905V

(12) else(13) 120574

119894= 119905V + max(120574

119894) isin edge 119890 119906 rarr V in 119866 with 119889

119890= 0

(14) end if

(15) 120574 = 1205741 1205742 120574

119898

(16) where m = number of entries in the topologically sorted list

(17) end for

(18) compute 120574 = max120574

Algorithm 2

Since HDL does not provide a method to represent infinitysome number say 255 can be chosen which is always greaterthan any otherweight in the incidencematrix Also since edgeweight 0 is a valid input any negative number say minus1 canbe used to denote the uninitialized matrix element In state1198781 all the zero weight edges in the DFG are found alongwith their source and destination nodes and are stored in amatrix called zero weight pathThe zero weight pathmatrixcontains two columns The first column contains the sourcenode of a directed zero-weight edge while the second columnhas the destination node of the directed zero-weight edgeSimultaneously we will keep a count on the number of zero-weight edges

The state 1198782 is provided to enable looping action andfor updating of all the signals In state 1198783 in each row ofzero weight path matrix the module will find the next node

with a zero-weight edge connecting it to the node in theprevious column (if it exists) Thus if the destination node inany zero-weight path is same as the source node in anotherzero-weight path the two paths are concatenated that is ifthe destination node in path 119886 is the source node in path 119887then we make the destination node in 119887 as the destinationnode of 119886 The state 1198784 is provided to enable looping actionand for updating all the signals At the end of this state the119911119890119903119900 119908119890119894119892ℎ119905 119901119886119905ℎ matrix will contain only those supersetpaths that are a superset of the remaining zero weight pathsIn state 1198785 the module calculates the sum of all the nodeweights through each of these paths State 1198786 is provided forlooping action and for updating all signals

In state 1198787 the path with the highest node weights sum isfound which is the critical path of the DFG All the nodes inthis path are then stored in order in amatrix called the critical

VLSI Design 9

(1) Algorithm for computing the shortest path

(2) Input a DFG of G = (VEtd) Where c is the computation time of the node and d

(3) is the initial delay on edge E

(4) Output All pair shortest path matrix M

(5) for i = 1 to N

(6) for j = 1 to N

(7) if i = j then

(8) M[ij] = (00)

(9) else M[ij] = inf

(10) end for

(11) end for

(12) for all the edges 119890 119906 rarr V119872[119906 V] = 119889 for edge e

(13) for 119896 rarr 1 to N

(14) for 119894 rarr 1 to N

(15) for 119895 rarr 1 to N

(16) if 119872[119894 119895] gt 119872[119894 119896] + 119872[119896 119895]

(17) M[ij] = M[ik] + M[kj]

(18) end for

(19) end for

(20) end for

(21) Output shortest path matrix M

Algorithm 3

1 2 3 4 50

2

4

6

8

10

12

14

16

18

Vertex number in filter DFG

Zero

del

ay p

ath

for fi

lter D

FG

Figure 7 Zero path delays and critical path for 4th-order low passelliptic filter

path matrix These signals in this matrix are output as thecritical pathThe state 1198788 is provided to enable looping actionand for updating all the signals The state machine then goesback to state 1198780 and awaits new inputs Next algorithm to findthe shortest path between two nodes in a graph is describedFor retiming technique in high level synthesis we need theshortest path to solve system of inequalities It is seen thattime needed to compute critical path on FPGA is reasonablyless when compared to computation on general purposeprocessor This also reduces the retiming computation timeThe zero delay paths are computed for 4th-order elliptic filtershown in Figure 7 The highlighted path delay is from1 rarr

9 rarr 5 rarr 14 rarr 8 where nodes 1 5 8 are adders and 9 14

are multipliers Maximum path delay which is highlighted isconsidered to be the critical path

32 Shortest Path Solver Algorithm and State Diagram Let119863(119906 V) be the maximum delay between nodes 119906 and V andlet 119879(119906 V) be total computation time of zero delay path from119906 to V We can check the condition 119879(119906 V) minusmin119905(119906) 119905(V) gt119889119890119903119894V119890119889 119888119897119900119888119896 119901119890119903119894119900119889 then select those paths to retime sothat computation time in this path can be reduced Wehave to retime the edges by constructing system of linearinequalities This can be done using Floyd-Warshall shortestpath algorithmAlgorithm 3This can be used for retiming thegraph further (Figure 6)

Floyd-Warshall all pair shortest path algorithm isdesigned and implemented as a part of path solvers on FPGA[17] which reduces the computational burden of generalpurpose processor where actual retiming has been carriedout The speed of computation is also increased by a largerextent The HDL program for the shortest path solver onFPGA was designed based on the state diagram shown inFigure 8 Updating of the looping variables is done in 1198781 andthen transition from 1198781 to 1198780 occurs The transition from 1198780

to 1198782 occurs after the incidence matrix is completely copiedto the signal weight temp In state 1198782 the signal weight tempis operated upon to obtain the pair wise shortest path matrixwith state 1198783 enabling looping action Transition from 1198782 to1198783 takes place after each pair wise path distance is foundUpdating of the looping variables is done in 1198783 and thentransition from 1198783 to 1198782 occurs The transition from 1198782 to1198784 occurs after all the pair wise shortest paths are stored inthe signal weight temp In the state 1198784 the elements of thesignal matrix weight temp are copied to the output matrix

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

4 VLSI Design

period that can give us a feasible solution we stopand find theminimal clock period by solving for solu-tions critical path The solution contains the retimingvalues for all nodes Here 4th-order low pass IIRelliptic filter is designed and retimed using the abovementioned clock periodminimization algorithmTheretimed data flow graph with reduced clock period isshown in Figure 2

After applying retiming transformation to the filter thecritical path changes to 2 rarr 1 rarr 9

Since each delay element occupies about one-third ofthe binary adder it is important to reduce the number ofdelay elements [11] In retiming using register minimizationwe can obtain the digital filter that uses minimum numberof registers and satisfies the clock period constraints [15]Here forward splitting or register sharing [12] is used Ifthe node has several output edges carrying the same signalthe number of registers required to implement these edgesis the maximum number of registers on any one of theedges Consider Figure 3 The maximum number of registersrequired in Figure 3(a) is 6 whereas after register sharing thisgets reduced to 3 as shown in Figure 3(b)

The number of registers needed to construct this outputedges (119890) in retimed graph119882

119903and the total cost are

119877V = Max (119882119903(119890)) Cost = sum119877V (4)

The cost is WRT

(i) fan-out constraints 119877119881ge 119882119903for all 119881 and all edges

119881

119890

997888rarr 119886119899119910 119900119905ℎ119890119903 V119890119903119905119890119909

(ii) feasibility constraints 119903(119880) minus 119903(119881) ge 119882(119890) for everyedge 119880 119890997888rarr 119881

(iii) clock period constraints 119903(119880)minus119903(119881) ge 119882(119880119881)minus1

for all vertices such that 119863(119880119881) ge 119888 where 119888 is theclock period

This method makes use of gadgets to represent the nodeswith multiple edges The register minimization retiming canbe modeled as linear programming problem A dummy nodewith zero computation time will be introduced in this Theweight of the edge 119890

119894is defined to be119882(119890

119894) = 119882maxnot119882(119890119894)

where119882max = max(119882(119890119894)) where 1 le 119894 le 119870 where 119896 is the

number of edges available Also 120573 parameter is used which isthe breadth associated tomodel thememory required by edge119890119894 The breadth of each edge is inverse of 119896 A binary search

is performed for clock period and below is the procedureused while performing retiming using register minimisationThe register minimization retiming values can be obtained asbelow

(i) Use the gadgetmodel of the graph to compute the costfunction

(ii) Calculate 1198781015840 by using shortest path Floyd-Warshallalgorithm

(iii) Compute 119863(119880119881) and 119882(119880119881) matrices from theoriginal graph and 1198781015840 matrix

(iv) Perform LP formulation such that the cost functiongets minimized which is subjected to feasibility andclock period constraints

This LP problem is solved to obtain the retiming solutionwhich minimizes the number of registers by satisfying theclock period Figure 4 shows the DFG of 4th-order low passIIR elliptic filter It is observed that the register minimumretimed solution provides the filter solution with reducedregister count for reduced clock period However in somecases it is found that clock period minimization efficiencyreduces in comparison to clock period minimization retim-ing technique as the priority is given to the register countFor the considered elliptic filter for a clock period of 4 unitsit is found that the register count gets minimized to 9 Afterapplying register minimization retiming transformation tothe filter the critical path changes to 1 rarr 9 rarr 5

Problem Formulation Critical path and shortest path solvingcontribute to most of the computation time in retiming

Definition 1 (the path solver problem) Let 119878 =

1199040 1199041 1199042 1199043 119904

119896 where 119896 is the maximum number

of feasible solutions available for retiming of a consideredfilter DFG During retiming of digital filters in high levelsynthesis the shortest path between the nodes must becomputed for (119896 + 1) times where 119896 is the number offeasible solutions available for the DFG which is nothingbut unique entries in path delay 119863 matrix Similarlythe critical path must be computed for (119896 + 1) Generalpurpose processors (GPPs) where retiming algorithm isimplemented are fully programmable but are less efficientin terms of power and performance Hence the problem isto improve the performance and power of retiming usingFPGA based path solvers Further along with retiming highlevel transformation technique called automatic pipeline isapplied to improve the filter speed

Definition 2 (multiple constant multiplication in digitalfilters) For the considered filter coefficient constant 119879 inthe retimed filters find the set of multiplierless operations1198741 1198742 1198743 119874

119899 with minimum number of addition

subtraction and shift operations using multiple constantmultiplier architecture to optimize the filter architecturefurther

Definition 3 (optimization and automation of filter HDL)An environment needs to be developed to obtain HDLsof retimed filters in which user can choose different datapath element architectures depending on the specificationsThis reduces time to market and helps to evaluate a lotof hardware implementation trade-offs Filter equivalencechecking after applying high level transformation needs tobe done which needs to be developed as a part of theoptimization environment

Principle of Shortest Path andMCMAlgorithm Several FPGAsynthesis algorithms have been proposed specifically forsequential circuits In [16] authors have proposed how to

VLSI Design 5

11

2

1

1

1

1

1

2

1

Adder 1

Adder 2

Adder 3

Adder 4

Adder 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Adder 11

Mult 12

Mult 13Mult 14

Mult 15

Mult 16

Mult 17

(a)

0

2

4

6

8

10

12

Clock period in time unitsRegister count

Filter after clock period minimization retiming

Filter before retiming

(b)

Figure 2 4th-order elliptic filter after clock period minimization retiming (a) DFG after retiming (b) clock period and register count beforeand after retiming

2

3

1

(a)

1 11

(b)

Figure 3 (a) Graph before register sharing (b) Graph after register sharing

map retimed circuits on to FPGAs efficiently However inthis paper authors suggest a method for efficient retimingprocess using FPGA based path solvers This can be appliedto any retiming techniques available in literature Shortestpath is solved in filter DFG using Floyd-Warshall algorithmThe Floyd-Warshall algorithm uses an approach of dynamicprogramming to solve the shortest-paths problem on a DFGThe Floyd-Warshall Algorithm can solve the shortest pathproblem in 119874(1198993) time where 119899 is the number of nodes inthe DFG Let 119889

119894119895(119896)denote the weight of the shortest path

from 119894 to 119895 such that all intermediate vertices are containedin the set 1 2 119896 That is the path 119901 is decomposed into119894 rarr 119896 rarr 119895 Let the vertices in the graph be numberedfrom 1 2 119899 Consider the subset 1 2 119896 of these 119899vertices Find the shortest path from vertex 119894 to vertex 119895 thatuses vertices in the set 1 2 119896 only Then there are twosituations possible

(i) 119896 is an intermediate vertex on the shortest path

(ii) 119896 is not an intermediate vertex on the shortest path

If the vertex 119896 is not an intermediate vertex on 119901 then

119889119894119895(119896) = 119889

119894119895(119896 minus 1) else 119889

119894119895(119896) = 119889

119894119896(119896 minus 1) + 119889

119896119895(119896 minus 1)

(5)

In either case the subpaths contain nodes from 1 2 (119896minus

1) Therefore

119889119894119895(119896) = 119889

119894119895(119896 minus 1) + 119889

119896119895(119896 minus 1) (6)

When 119896 = 0 then

119889119894119895(0) = 119882

119894119895

and 119894119891 1198960 then 119889119894119895(119896) = min 119889

119894119895(119896 minus 1) + 119889

119894119895(119896 minus 1)

(7)

Let119863 be the incidence matrix with the graph edge weightinformation119882 initially119863 is then updatedwith the calculatedshortest paths see Algorithm 1

The final 119863 matrix will store all the shortest paths Thisalgorithm is extended for retiming of digital filters

The multiple constant multiplication (MCM) problem isaddressed in the literature [14] using either graph basedmeth-ods or using common subexpression elimination method Incommon subexpression elimination algorithm all possiblesubexpressions are extracted for a variable But this is possibleonly if it is defined as minimum signed digit and as canonicalsigned digit Then the subexpression is found such that itcan be shared by multiple constant multiplication valuesIn this paper the above two concepts are extended forautomatic pipelining and retiming of digital filters in high

6 VLSI Design

11

12

1

11

1

11

1 1

1

2

Adder 1

Adder 2

Adder 3

Adder 4

Add 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Mult 11

Mult 12

Mult 13

Mult 14Mult 15

Mult 16

Mult 17

(a)

0123456789

Clock period in time unitsRegister count

Filter before retiming Filter after registerminimization retiming

(b)

Figure 4 4th-order elliptic filter after register minimization retiming (a) DFG after retiming (b) clock period and register count before andafter retiming

(1) n = of rows in W 119863

0 = W

(2) for(k=1 to n)

(3) for(i=1 to n)

(4) for(j=1 to n)

(5) 119889

119896

119894119895= min119889(119896minus1)

119894119895 119889

(119896minus1)

119894119896+ 119889

(119896minus1)

119894119896

(6) end for

(7) end for

(8) end for

(9) return 119863

119899

Algorithm 1

level synthesis In all the digital filters the filter coefficientsare known beforehand Hence full flexibility of themultiplieris not necessary and we can make use of MCM designsThis method is more efficient when compared to shift andadd multiplications as intermediate results can be sharedwhich reduces the area of multiplierless implementation ofdigital filters The sharing of intermediate result will providepotential area saving with increased filter order (Figure 5)

Consider the filter coefficient set which is to be used forthe filter design given by119879 = 119888

1 1198882 1198883 119888

119899 we need to find

the smallest set 119878 given by 1198861 1198862 1198863 s

1 1199042 1199043 where

119886 (addersSubtractors) amp 119904 (shifts) lt 119878 such that the set ismade of adderssubtracters shifters and 119860 operations Hereshift operations also can be shared across multiple points sothat the output set is optimum Here 119867cub algorithm [8] isused to generate corresponding DFG for the multiplier blockimplementing the parallelmultiplications 119888

1lowast119909 1198882lowast119909 119888

119899lowast

119909 The only operations used in the generated DAG andinput design matrices are additions subtractions shifts andnegations In this paper performance of MCM based filterdesigns is further improved by combining this approach withretiming The multiplierless filter circuit is further retimed

to reduce the overall clock period which increases the clockfrequency

Consider 1198971and 1198972as two integerswhich specifies left shifts

and 119903 ge 0 specifies right shift and let 119904 be the sign bit whichcan be 0 1 An119860 operation is an operation with two integerinputs 119906 and V and one fundamental output which is definedas

119860119901(119906 V) = 100381610038161003816

1003816(119906 ≪ 119897

1) + (1) 119904 (V ≪ 119897

2)

1003816100381610038161003816

≫ 119903 = 2

1198971119906 + (minus1)

1199042

1198972V | 2minus119903

(8)

where≪ is a left binary shift≫ is a right binary shift and 119901 =1198971 1198972 119903 119904 is the parameter set or the 119860 configuration of 119860

119901

To preserve all significant bits of the output 2119903 must divide2

1198971119906 + (minus1)

1199042

1198972V The left shifts are limited to the bit width of

the target All 119860 operations are used to build 119860 minus 119892119903119886119901ℎ Fora given set of target filter coefficients119862 we can find set 119878 suchthat multiplierless digital filter is designed

VLSI Design 7

MCMblock

Y

k1Y

k2Y

k3Y

k4Y

k5Y

(a)

Y

8Y

32Y

(minus1)7Y

(minus1)31Y

O1

O2

14Y

O3 = 45Y = 31Y + 14Y

(+)45Y

O3

O1 = 7Y = 8Y minus 1Y

O2 = 31Y = 32Y minus 1Y

≪3

≪5

≪1

(b)

Figure 5 Example for addressing MCM problem in digital filters

3 Design and Analysis

EachDSPfilter block is associatedwith the critical pathwhichlimits maximum iteration period in the filter design [12]This can be reduced by retiming where the clock period getsreduced and increases the clock speed To reduce the criticalpath we need to find the original critical path of the circuitusing critical path solving algorithm and then apply retimingtransformation to digital filter While retiming shortest pathalgorithm is required for solving the system inequalitiesFPGAs are nothing but set of configurable logic blockswith configurable interconnects Designer can program itto work like a specific hardware These give great speedupover general purpose processors for many long runningalgorithms Hence for high performance systems FPGAsbecome a better choice In the present work path solvers areimplemented on FPGA to increase the performance

31 Critical Path Solver Algorithm Design and Analysis Thecritical path is defined as maximum delay path between theoutput node and node causing the state change of the outputnode with zero delay The significance of the critical pathis that it determines the operating frequency of the designIn retiming which is one among the steps in high levelsynthesis it is imperative that we find the critical path [17]in real time To speed up this process the use of a dedicatedFPGA hardware can speed up the process with low powerConsider 120572 = Number of adder elements and 120573 = Numberof multiplier elements in the considered digital filter Let= 1198991 1198992 119899

119894 where 119894 = 120572 + 120573 which is maximum

combinational adder and multiplier elements Consider 119874 =1199001 1199002 119900

119895where 119874 is the set of output nodes in the filter

circuit and 119868 = 1198941 1198942 119894

119896 where 119868 is the set of input

nodes in the filter circuit such that IN and ON The criticalpath of the circuit is defined in terms of 120574

1198991

which is thedelay of individual combinational block In this procedureof computing critical path on FPGA it sorts the verticessuch that vertices occurring early in the list are connected tovertices later in the list by edges having zero delays While

sorting if the vertex is connected to previous one then pathlength is sum of its time with the sum of all the vertices foundin the path otherwise path length of the node is equal toits own computation time We need this for constructing theretimed graph as well as verifying the retimed graph resultThe equation of the critical path is

120574119898=

119894=119873

sum

119894=1

1199051198981

(9)

where 119873 is the sum of adder and multiplier elements in thetopologically sorted vertices connectedwith zero delay edgesThe delay of the circuit is given by 119905

119889= max120574

119898 where

119905119889is the delay of the critical path Algorithm 2 shows the

critical path formulation In the considered optimizationenvironment the below steps are used for critical pathcomputation

(i) The filter network graph is considered as input tocritical path solver algorithm

(ii) All the zero-weight edges in the network graph arefound and a matrix of their source and destinationnodes is formed

(iii) For each row in the above matrix if the destinationnode of any zero-weight edge path is the same as thesource node of the zero-weight edge path the twopaths are joined This step is repeated to obtain amatrix whose rows will have nodes of all the possiblezero-weight edge paths in the graph

(iv) The computational time of each zero-weight edgepath from this matrix is calculated

(v) The zero-weight edge path with the greatest compu-tational time is found This is the critical path and itscomputational time is the critical path delay

A critical path solver algorithm is designed in the presentwork on FPGA The state diagram for the implementedcritical path solver is given in Figure 6 In 1198780 the filter graphor matrix is given as input to the critical path solver module

8 VLSI Design

Loop indices updated

Loop indices updated

Next greater path delay found

Loop indices updated

Loop indices updated

All

path

del

ay fo

und

Superset of zero-weight path found

All

zero

pat

h de

lay

foun

d

Weight temp updated

Weight temp updated

S0

S1 S2

S3 S4S5S6

S7S8

Incidence and node matrix

obtained

Greates

t path

delay

found

Zero-weight path updated

Figure 6 Critical path solver state diagram

(1) Algorithm for computing the critical path

(2) Input a DFG of G = (VEtd) Where c is the

(3) computation time of the node and d

(4) is the initial delay on edge E

(5) Output Critical path C

(6) Sort all the vertices topologically in the DFG G

(7) with v fallowing u

(8) if there is a zero delay edge from 119906 rarr V(9) For all vertices from the sorted list

(10) If non zero delay on the edge E in G then

(11) 120574119894= 119905V

(12) else(13) 120574

119894= 119905V + max(120574

119894) isin edge 119890 119906 rarr V in 119866 with 119889

119890= 0

(14) end if

(15) 120574 = 1205741 1205742 120574

119898

(16) where m = number of entries in the topologically sorted list

(17) end for

(18) compute 120574 = max120574

Algorithm 2

Since HDL does not provide a method to represent infinitysome number say 255 can be chosen which is always greaterthan any otherweight in the incidencematrix Also since edgeweight 0 is a valid input any negative number say minus1 canbe used to denote the uninitialized matrix element In state1198781 all the zero weight edges in the DFG are found alongwith their source and destination nodes and are stored in amatrix called zero weight pathThe zero weight pathmatrixcontains two columns The first column contains the sourcenode of a directed zero-weight edge while the second columnhas the destination node of the directed zero-weight edgeSimultaneously we will keep a count on the number of zero-weight edges

The state 1198782 is provided to enable looping action andfor updating of all the signals In state 1198783 in each row ofzero weight path matrix the module will find the next node

with a zero-weight edge connecting it to the node in theprevious column (if it exists) Thus if the destination node inany zero-weight path is same as the source node in anotherzero-weight path the two paths are concatenated that is ifthe destination node in path 119886 is the source node in path 119887then we make the destination node in 119887 as the destinationnode of 119886 The state 1198784 is provided to enable looping actionand for updating all the signals At the end of this state the119911119890119903119900 119908119890119894119892ℎ119905 119901119886119905ℎ matrix will contain only those supersetpaths that are a superset of the remaining zero weight pathsIn state 1198785 the module calculates the sum of all the nodeweights through each of these paths State 1198786 is provided forlooping action and for updating all signals

In state 1198787 the path with the highest node weights sum isfound which is the critical path of the DFG All the nodes inthis path are then stored in order in amatrix called the critical

VLSI Design 9

(1) Algorithm for computing the shortest path

(2) Input a DFG of G = (VEtd) Where c is the computation time of the node and d

(3) is the initial delay on edge E

(4) Output All pair shortest path matrix M

(5) for i = 1 to N

(6) for j = 1 to N

(7) if i = j then

(8) M[ij] = (00)

(9) else M[ij] = inf

(10) end for

(11) end for

(12) for all the edges 119890 119906 rarr V119872[119906 V] = 119889 for edge e

(13) for 119896 rarr 1 to N

(14) for 119894 rarr 1 to N

(15) for 119895 rarr 1 to N

(16) if 119872[119894 119895] gt 119872[119894 119896] + 119872[119896 119895]

(17) M[ij] = M[ik] + M[kj]

(18) end for

(19) end for

(20) end for

(21) Output shortest path matrix M

Algorithm 3

1 2 3 4 50

2

4

6

8

10

12

14

16

18

Vertex number in filter DFG

Zero

del

ay p

ath

for fi

lter D

FG

Figure 7 Zero path delays and critical path for 4th-order low passelliptic filter

path matrix These signals in this matrix are output as thecritical pathThe state 1198788 is provided to enable looping actionand for updating all the signals The state machine then goesback to state 1198780 and awaits new inputs Next algorithm to findthe shortest path between two nodes in a graph is describedFor retiming technique in high level synthesis we need theshortest path to solve system of inequalities It is seen thattime needed to compute critical path on FPGA is reasonablyless when compared to computation on general purposeprocessor This also reduces the retiming computation timeThe zero delay paths are computed for 4th-order elliptic filtershown in Figure 7 The highlighted path delay is from1 rarr

9 rarr 5 rarr 14 rarr 8 where nodes 1 5 8 are adders and 9 14

are multipliers Maximum path delay which is highlighted isconsidered to be the critical path

32 Shortest Path Solver Algorithm and State Diagram Let119863(119906 V) be the maximum delay between nodes 119906 and V andlet 119879(119906 V) be total computation time of zero delay path from119906 to V We can check the condition 119879(119906 V) minusmin119905(119906) 119905(V) gt119889119890119903119894V119890119889 119888119897119900119888119896 119901119890119903119894119900119889 then select those paths to retime sothat computation time in this path can be reduced Wehave to retime the edges by constructing system of linearinequalities This can be done using Floyd-Warshall shortestpath algorithmAlgorithm 3This can be used for retiming thegraph further (Figure 6)

Floyd-Warshall all pair shortest path algorithm isdesigned and implemented as a part of path solvers on FPGA[17] which reduces the computational burden of generalpurpose processor where actual retiming has been carriedout The speed of computation is also increased by a largerextent The HDL program for the shortest path solver onFPGA was designed based on the state diagram shown inFigure 8 Updating of the looping variables is done in 1198781 andthen transition from 1198781 to 1198780 occurs The transition from 1198780

to 1198782 occurs after the incidence matrix is completely copiedto the signal weight temp In state 1198782 the signal weight tempis operated upon to obtain the pair wise shortest path matrixwith state 1198783 enabling looping action Transition from 1198782 to1198783 takes place after each pair wise path distance is foundUpdating of the looping variables is done in 1198783 and thentransition from 1198783 to 1198782 occurs The transition from 1198782 to1198784 occurs after all the pair wise shortest paths are stored inthe signal weight temp In the state 1198784 the elements of thesignal matrix weight temp are copied to the output matrix

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

VLSI Design 5

11

2

1

1

1

1

1

2

1

Adder 1

Adder 2

Adder 3

Adder 4

Adder 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Adder 11

Mult 12

Mult 13Mult 14

Mult 15

Mult 16

Mult 17

(a)

0

2

4

6

8

10

12

Clock period in time unitsRegister count

Filter after clock period minimization retiming

Filter before retiming

(b)

Figure 2 4th-order elliptic filter after clock period minimization retiming (a) DFG after retiming (b) clock period and register count beforeand after retiming

2

3

1

(a)

1 11

(b)

Figure 3 (a) Graph before register sharing (b) Graph after register sharing

map retimed circuits on to FPGAs efficiently However inthis paper authors suggest a method for efficient retimingprocess using FPGA based path solvers This can be appliedto any retiming techniques available in literature Shortestpath is solved in filter DFG using Floyd-Warshall algorithmThe Floyd-Warshall algorithm uses an approach of dynamicprogramming to solve the shortest-paths problem on a DFGThe Floyd-Warshall Algorithm can solve the shortest pathproblem in 119874(1198993) time where 119899 is the number of nodes inthe DFG Let 119889

119894119895(119896)denote the weight of the shortest path

from 119894 to 119895 such that all intermediate vertices are containedin the set 1 2 119896 That is the path 119901 is decomposed into119894 rarr 119896 rarr 119895 Let the vertices in the graph be numberedfrom 1 2 119899 Consider the subset 1 2 119896 of these 119899vertices Find the shortest path from vertex 119894 to vertex 119895 thatuses vertices in the set 1 2 119896 only Then there are twosituations possible

(i) 119896 is an intermediate vertex on the shortest path

(ii) 119896 is not an intermediate vertex on the shortest path

If the vertex 119896 is not an intermediate vertex on 119901 then

119889119894119895(119896) = 119889

119894119895(119896 minus 1) else 119889

119894119895(119896) = 119889

119894119896(119896 minus 1) + 119889

119896119895(119896 minus 1)

(5)

In either case the subpaths contain nodes from 1 2 (119896minus

1) Therefore

119889119894119895(119896) = 119889

119894119895(119896 minus 1) + 119889

119896119895(119896 minus 1) (6)

When 119896 = 0 then

119889119894119895(0) = 119882

119894119895

and 119894119891 1198960 then 119889119894119895(119896) = min 119889

119894119895(119896 minus 1) + 119889

119894119895(119896 minus 1)

(7)

Let119863 be the incidence matrix with the graph edge weightinformation119882 initially119863 is then updatedwith the calculatedshortest paths see Algorithm 1

The final 119863 matrix will store all the shortest paths Thisalgorithm is extended for retiming of digital filters

The multiple constant multiplication (MCM) problem isaddressed in the literature [14] using either graph basedmeth-ods or using common subexpression elimination method Incommon subexpression elimination algorithm all possiblesubexpressions are extracted for a variable But this is possibleonly if it is defined as minimum signed digit and as canonicalsigned digit Then the subexpression is found such that itcan be shared by multiple constant multiplication valuesIn this paper the above two concepts are extended forautomatic pipelining and retiming of digital filters in high

6 VLSI Design

11

12

1

11

1

11

1 1

1

2

Adder 1

Adder 2

Adder 3

Adder 4

Add 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Mult 11

Mult 12

Mult 13

Mult 14Mult 15

Mult 16

Mult 17

(a)

0123456789

Clock period in time unitsRegister count

Filter before retiming Filter after registerminimization retiming

(b)

Figure 4 4th-order elliptic filter after register minimization retiming (a) DFG after retiming (b) clock period and register count before andafter retiming

(1) n = of rows in W 119863

0 = W

(2) for(k=1 to n)

(3) for(i=1 to n)

(4) for(j=1 to n)

(5) 119889

119896

119894119895= min119889(119896minus1)

119894119895 119889

(119896minus1)

119894119896+ 119889

(119896minus1)

119894119896

(6) end for

(7) end for

(8) end for

(9) return 119863

119899

Algorithm 1

level synthesis In all the digital filters the filter coefficientsare known beforehand Hence full flexibility of themultiplieris not necessary and we can make use of MCM designsThis method is more efficient when compared to shift andadd multiplications as intermediate results can be sharedwhich reduces the area of multiplierless implementation ofdigital filters The sharing of intermediate result will providepotential area saving with increased filter order (Figure 5)

Consider the filter coefficient set which is to be used forthe filter design given by119879 = 119888

1 1198882 1198883 119888

119899 we need to find

the smallest set 119878 given by 1198861 1198862 1198863 s

1 1199042 1199043 where

119886 (addersSubtractors) amp 119904 (shifts) lt 119878 such that the set ismade of adderssubtracters shifters and 119860 operations Hereshift operations also can be shared across multiple points sothat the output set is optimum Here 119867cub algorithm [8] isused to generate corresponding DFG for the multiplier blockimplementing the parallelmultiplications 119888

1lowast119909 1198882lowast119909 119888

119899lowast

119909 The only operations used in the generated DAG andinput design matrices are additions subtractions shifts andnegations In this paper performance of MCM based filterdesigns is further improved by combining this approach withretiming The multiplierless filter circuit is further retimed

to reduce the overall clock period which increases the clockfrequency

Consider 1198971and 1198972as two integerswhich specifies left shifts

and 119903 ge 0 specifies right shift and let 119904 be the sign bit whichcan be 0 1 An119860 operation is an operation with two integerinputs 119906 and V and one fundamental output which is definedas

119860119901(119906 V) = 100381610038161003816

1003816(119906 ≪ 119897

1) + (1) 119904 (V ≪ 119897

2)

1003816100381610038161003816

≫ 119903 = 2

1198971119906 + (minus1)

1199042

1198972V | 2minus119903

(8)

where≪ is a left binary shift≫ is a right binary shift and 119901 =1198971 1198972 119903 119904 is the parameter set or the 119860 configuration of 119860

119901

To preserve all significant bits of the output 2119903 must divide2

1198971119906 + (minus1)

1199042

1198972V The left shifts are limited to the bit width of

the target All 119860 operations are used to build 119860 minus 119892119903119886119901ℎ Fora given set of target filter coefficients119862 we can find set 119878 suchthat multiplierless digital filter is designed

VLSI Design 7

MCMblock

Y

k1Y

k2Y

k3Y

k4Y

k5Y

(a)

Y

8Y

32Y

(minus1)7Y

(minus1)31Y

O1

O2

14Y

O3 = 45Y = 31Y + 14Y

(+)45Y

O3

O1 = 7Y = 8Y minus 1Y

O2 = 31Y = 32Y minus 1Y

≪3

≪5

≪1

(b)

Figure 5 Example for addressing MCM problem in digital filters

3 Design and Analysis

EachDSPfilter block is associatedwith the critical pathwhichlimits maximum iteration period in the filter design [12]This can be reduced by retiming where the clock period getsreduced and increases the clock speed To reduce the criticalpath we need to find the original critical path of the circuitusing critical path solving algorithm and then apply retimingtransformation to digital filter While retiming shortest pathalgorithm is required for solving the system inequalitiesFPGAs are nothing but set of configurable logic blockswith configurable interconnects Designer can program itto work like a specific hardware These give great speedupover general purpose processors for many long runningalgorithms Hence for high performance systems FPGAsbecome a better choice In the present work path solvers areimplemented on FPGA to increase the performance

31 Critical Path Solver Algorithm Design and Analysis Thecritical path is defined as maximum delay path between theoutput node and node causing the state change of the outputnode with zero delay The significance of the critical pathis that it determines the operating frequency of the designIn retiming which is one among the steps in high levelsynthesis it is imperative that we find the critical path [17]in real time To speed up this process the use of a dedicatedFPGA hardware can speed up the process with low powerConsider 120572 = Number of adder elements and 120573 = Numberof multiplier elements in the considered digital filter Let= 1198991 1198992 119899

119894 where 119894 = 120572 + 120573 which is maximum

combinational adder and multiplier elements Consider 119874 =1199001 1199002 119900

119895where 119874 is the set of output nodes in the filter

circuit and 119868 = 1198941 1198942 119894

119896 where 119868 is the set of input

nodes in the filter circuit such that IN and ON The criticalpath of the circuit is defined in terms of 120574

1198991

which is thedelay of individual combinational block In this procedureof computing critical path on FPGA it sorts the verticessuch that vertices occurring early in the list are connected tovertices later in the list by edges having zero delays While

sorting if the vertex is connected to previous one then pathlength is sum of its time with the sum of all the vertices foundin the path otherwise path length of the node is equal toits own computation time We need this for constructing theretimed graph as well as verifying the retimed graph resultThe equation of the critical path is

120574119898=

119894=119873

sum

119894=1

1199051198981

(9)

where 119873 is the sum of adder and multiplier elements in thetopologically sorted vertices connectedwith zero delay edgesThe delay of the circuit is given by 119905

119889= max120574

119898 where

119905119889is the delay of the critical path Algorithm 2 shows the

critical path formulation In the considered optimizationenvironment the below steps are used for critical pathcomputation

(i) The filter network graph is considered as input tocritical path solver algorithm

(ii) All the zero-weight edges in the network graph arefound and a matrix of their source and destinationnodes is formed

(iii) For each row in the above matrix if the destinationnode of any zero-weight edge path is the same as thesource node of the zero-weight edge path the twopaths are joined This step is repeated to obtain amatrix whose rows will have nodes of all the possiblezero-weight edge paths in the graph

(iv) The computational time of each zero-weight edgepath from this matrix is calculated

(v) The zero-weight edge path with the greatest compu-tational time is found This is the critical path and itscomputational time is the critical path delay

A critical path solver algorithm is designed in the presentwork on FPGA The state diagram for the implementedcritical path solver is given in Figure 6 In 1198780 the filter graphor matrix is given as input to the critical path solver module

8 VLSI Design

Loop indices updated

Loop indices updated

Next greater path delay found

Loop indices updated

Loop indices updated

All

path

del

ay fo

und

Superset of zero-weight path found

All

zero

pat

h de

lay

foun

d

Weight temp updated

Weight temp updated

S0

S1 S2

S3 S4S5S6

S7S8

Incidence and node matrix

obtained

Greates

t path

delay

found

Zero-weight path updated

Figure 6 Critical path solver state diagram

(1) Algorithm for computing the critical path

(2) Input a DFG of G = (VEtd) Where c is the

(3) computation time of the node and d

(4) is the initial delay on edge E

(5) Output Critical path C

(6) Sort all the vertices topologically in the DFG G

(7) with v fallowing u

(8) if there is a zero delay edge from 119906 rarr V(9) For all vertices from the sorted list

(10) If non zero delay on the edge E in G then

(11) 120574119894= 119905V

(12) else(13) 120574

119894= 119905V + max(120574

119894) isin edge 119890 119906 rarr V in 119866 with 119889

119890= 0

(14) end if

(15) 120574 = 1205741 1205742 120574

119898

(16) where m = number of entries in the topologically sorted list

(17) end for

(18) compute 120574 = max120574

Algorithm 2

Since HDL does not provide a method to represent infinitysome number say 255 can be chosen which is always greaterthan any otherweight in the incidencematrix Also since edgeweight 0 is a valid input any negative number say minus1 canbe used to denote the uninitialized matrix element In state1198781 all the zero weight edges in the DFG are found alongwith their source and destination nodes and are stored in amatrix called zero weight pathThe zero weight pathmatrixcontains two columns The first column contains the sourcenode of a directed zero-weight edge while the second columnhas the destination node of the directed zero-weight edgeSimultaneously we will keep a count on the number of zero-weight edges

The state 1198782 is provided to enable looping action andfor updating of all the signals In state 1198783 in each row ofzero weight path matrix the module will find the next node

with a zero-weight edge connecting it to the node in theprevious column (if it exists) Thus if the destination node inany zero-weight path is same as the source node in anotherzero-weight path the two paths are concatenated that is ifthe destination node in path 119886 is the source node in path 119887then we make the destination node in 119887 as the destinationnode of 119886 The state 1198784 is provided to enable looping actionand for updating all the signals At the end of this state the119911119890119903119900 119908119890119894119892ℎ119905 119901119886119905ℎ matrix will contain only those supersetpaths that are a superset of the remaining zero weight pathsIn state 1198785 the module calculates the sum of all the nodeweights through each of these paths State 1198786 is provided forlooping action and for updating all signals

In state 1198787 the path with the highest node weights sum isfound which is the critical path of the DFG All the nodes inthis path are then stored in order in amatrix called the critical

VLSI Design 9

(1) Algorithm for computing the shortest path

(2) Input a DFG of G = (VEtd) Where c is the computation time of the node and d

(3) is the initial delay on edge E

(4) Output All pair shortest path matrix M

(5) for i = 1 to N

(6) for j = 1 to N

(7) if i = j then

(8) M[ij] = (00)

(9) else M[ij] = inf

(10) end for

(11) end for

(12) for all the edges 119890 119906 rarr V119872[119906 V] = 119889 for edge e

(13) for 119896 rarr 1 to N

(14) for 119894 rarr 1 to N

(15) for 119895 rarr 1 to N

(16) if 119872[119894 119895] gt 119872[119894 119896] + 119872[119896 119895]

(17) M[ij] = M[ik] + M[kj]

(18) end for

(19) end for

(20) end for

(21) Output shortest path matrix M

Algorithm 3

1 2 3 4 50

2

4

6

8

10

12

14

16

18

Vertex number in filter DFG

Zero

del

ay p

ath

for fi

lter D

FG

Figure 7 Zero path delays and critical path for 4th-order low passelliptic filter

path matrix These signals in this matrix are output as thecritical pathThe state 1198788 is provided to enable looping actionand for updating all the signals The state machine then goesback to state 1198780 and awaits new inputs Next algorithm to findthe shortest path between two nodes in a graph is describedFor retiming technique in high level synthesis we need theshortest path to solve system of inequalities It is seen thattime needed to compute critical path on FPGA is reasonablyless when compared to computation on general purposeprocessor This also reduces the retiming computation timeThe zero delay paths are computed for 4th-order elliptic filtershown in Figure 7 The highlighted path delay is from1 rarr

9 rarr 5 rarr 14 rarr 8 where nodes 1 5 8 are adders and 9 14

are multipliers Maximum path delay which is highlighted isconsidered to be the critical path

32 Shortest Path Solver Algorithm and State Diagram Let119863(119906 V) be the maximum delay between nodes 119906 and V andlet 119879(119906 V) be total computation time of zero delay path from119906 to V We can check the condition 119879(119906 V) minusmin119905(119906) 119905(V) gt119889119890119903119894V119890119889 119888119897119900119888119896 119901119890119903119894119900119889 then select those paths to retime sothat computation time in this path can be reduced Wehave to retime the edges by constructing system of linearinequalities This can be done using Floyd-Warshall shortestpath algorithmAlgorithm 3This can be used for retiming thegraph further (Figure 6)

Floyd-Warshall all pair shortest path algorithm isdesigned and implemented as a part of path solvers on FPGA[17] which reduces the computational burden of generalpurpose processor where actual retiming has been carriedout The speed of computation is also increased by a largerextent The HDL program for the shortest path solver onFPGA was designed based on the state diagram shown inFigure 8 Updating of the looping variables is done in 1198781 andthen transition from 1198781 to 1198780 occurs The transition from 1198780

to 1198782 occurs after the incidence matrix is completely copiedto the signal weight temp In state 1198782 the signal weight tempis operated upon to obtain the pair wise shortest path matrixwith state 1198783 enabling looping action Transition from 1198782 to1198783 takes place after each pair wise path distance is foundUpdating of the looping variables is done in 1198783 and thentransition from 1198783 to 1198782 occurs The transition from 1198782 to1198784 occurs after all the pair wise shortest paths are stored inthe signal weight temp In the state 1198784 the elements of thesignal matrix weight temp are copied to the output matrix

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

6 VLSI Design

11

12

1

11

1

11

1 1

1

2

Adder 1

Adder 2

Adder 3

Adder 4

Add 5

Adder 6

Adder 7

Adder 8

Mult 9

Mult 10

Mult 11

Mult 12

Mult 13

Mult 14Mult 15

Mult 16

Mult 17

(a)

0123456789

Clock period in time unitsRegister count

Filter before retiming Filter after registerminimization retiming

(b)

Figure 4 4th-order elliptic filter after register minimization retiming (a) DFG after retiming (b) clock period and register count before andafter retiming

(1) n = of rows in W 119863

0 = W

(2) for(k=1 to n)

(3) for(i=1 to n)

(4) for(j=1 to n)

(5) 119889

119896

119894119895= min119889(119896minus1)

119894119895 119889

(119896minus1)

119894119896+ 119889

(119896minus1)

119894119896

(6) end for

(7) end for

(8) end for

(9) return 119863

119899

Algorithm 1

level synthesis In all the digital filters the filter coefficientsare known beforehand Hence full flexibility of themultiplieris not necessary and we can make use of MCM designsThis method is more efficient when compared to shift andadd multiplications as intermediate results can be sharedwhich reduces the area of multiplierless implementation ofdigital filters The sharing of intermediate result will providepotential area saving with increased filter order (Figure 5)

Consider the filter coefficient set which is to be used forthe filter design given by119879 = 119888

1 1198882 1198883 119888

119899 we need to find

the smallest set 119878 given by 1198861 1198862 1198863 s

1 1199042 1199043 where

119886 (addersSubtractors) amp 119904 (shifts) lt 119878 such that the set ismade of adderssubtracters shifters and 119860 operations Hereshift operations also can be shared across multiple points sothat the output set is optimum Here 119867cub algorithm [8] isused to generate corresponding DFG for the multiplier blockimplementing the parallelmultiplications 119888

1lowast119909 1198882lowast119909 119888

119899lowast

119909 The only operations used in the generated DAG andinput design matrices are additions subtractions shifts andnegations In this paper performance of MCM based filterdesigns is further improved by combining this approach withretiming The multiplierless filter circuit is further retimed

to reduce the overall clock period which increases the clockfrequency

Consider 1198971and 1198972as two integerswhich specifies left shifts

and 119903 ge 0 specifies right shift and let 119904 be the sign bit whichcan be 0 1 An119860 operation is an operation with two integerinputs 119906 and V and one fundamental output which is definedas

119860119901(119906 V) = 100381610038161003816

1003816(119906 ≪ 119897

1) + (1) 119904 (V ≪ 119897

2)

1003816100381610038161003816

≫ 119903 = 2

1198971119906 + (minus1)

1199042

1198972V | 2minus119903

(8)

where≪ is a left binary shift≫ is a right binary shift and 119901 =1198971 1198972 119903 119904 is the parameter set or the 119860 configuration of 119860

119901

To preserve all significant bits of the output 2119903 must divide2

1198971119906 + (minus1)

1199042

1198972V The left shifts are limited to the bit width of

the target All 119860 operations are used to build 119860 minus 119892119903119886119901ℎ Fora given set of target filter coefficients119862 we can find set 119878 suchthat multiplierless digital filter is designed

VLSI Design 7

MCMblock

Y

k1Y

k2Y

k3Y

k4Y

k5Y

(a)

Y

8Y

32Y

(minus1)7Y

(minus1)31Y

O1

O2

14Y

O3 = 45Y = 31Y + 14Y

(+)45Y

O3

O1 = 7Y = 8Y minus 1Y

O2 = 31Y = 32Y minus 1Y

≪3

≪5

≪1

(b)

Figure 5 Example for addressing MCM problem in digital filters

3 Design and Analysis

EachDSPfilter block is associatedwith the critical pathwhichlimits maximum iteration period in the filter design [12]This can be reduced by retiming where the clock period getsreduced and increases the clock speed To reduce the criticalpath we need to find the original critical path of the circuitusing critical path solving algorithm and then apply retimingtransformation to digital filter While retiming shortest pathalgorithm is required for solving the system inequalitiesFPGAs are nothing but set of configurable logic blockswith configurable interconnects Designer can program itto work like a specific hardware These give great speedupover general purpose processors for many long runningalgorithms Hence for high performance systems FPGAsbecome a better choice In the present work path solvers areimplemented on FPGA to increase the performance

31 Critical Path Solver Algorithm Design and Analysis Thecritical path is defined as maximum delay path between theoutput node and node causing the state change of the outputnode with zero delay The significance of the critical pathis that it determines the operating frequency of the designIn retiming which is one among the steps in high levelsynthesis it is imperative that we find the critical path [17]in real time To speed up this process the use of a dedicatedFPGA hardware can speed up the process with low powerConsider 120572 = Number of adder elements and 120573 = Numberof multiplier elements in the considered digital filter Let= 1198991 1198992 119899

119894 where 119894 = 120572 + 120573 which is maximum

combinational adder and multiplier elements Consider 119874 =1199001 1199002 119900

119895where 119874 is the set of output nodes in the filter

circuit and 119868 = 1198941 1198942 119894

119896 where 119868 is the set of input

nodes in the filter circuit such that IN and ON The criticalpath of the circuit is defined in terms of 120574

1198991

which is thedelay of individual combinational block In this procedureof computing critical path on FPGA it sorts the verticessuch that vertices occurring early in the list are connected tovertices later in the list by edges having zero delays While

sorting if the vertex is connected to previous one then pathlength is sum of its time with the sum of all the vertices foundin the path otherwise path length of the node is equal toits own computation time We need this for constructing theretimed graph as well as verifying the retimed graph resultThe equation of the critical path is

120574119898=

119894=119873

sum

119894=1

1199051198981

(9)

where 119873 is the sum of adder and multiplier elements in thetopologically sorted vertices connectedwith zero delay edgesThe delay of the circuit is given by 119905

119889= max120574

119898 where

119905119889is the delay of the critical path Algorithm 2 shows the

critical path formulation In the considered optimizationenvironment the below steps are used for critical pathcomputation

(i) The filter network graph is considered as input tocritical path solver algorithm

(ii) All the zero-weight edges in the network graph arefound and a matrix of their source and destinationnodes is formed

(iii) For each row in the above matrix if the destinationnode of any zero-weight edge path is the same as thesource node of the zero-weight edge path the twopaths are joined This step is repeated to obtain amatrix whose rows will have nodes of all the possiblezero-weight edge paths in the graph

(iv) The computational time of each zero-weight edgepath from this matrix is calculated

(v) The zero-weight edge path with the greatest compu-tational time is found This is the critical path and itscomputational time is the critical path delay

A critical path solver algorithm is designed in the presentwork on FPGA The state diagram for the implementedcritical path solver is given in Figure 6 In 1198780 the filter graphor matrix is given as input to the critical path solver module

8 VLSI Design

Loop indices updated

Loop indices updated

Next greater path delay found

Loop indices updated

Loop indices updated

All

path

del

ay fo

und

Superset of zero-weight path found

All

zero

pat

h de

lay

foun

d

Weight temp updated

Weight temp updated

S0

S1 S2

S3 S4S5S6

S7S8

Incidence and node matrix

obtained

Greates

t path

delay

found

Zero-weight path updated

Figure 6 Critical path solver state diagram

(1) Algorithm for computing the critical path

(2) Input a DFG of G = (VEtd) Where c is the

(3) computation time of the node and d

(4) is the initial delay on edge E

(5) Output Critical path C

(6) Sort all the vertices topologically in the DFG G

(7) with v fallowing u

(8) if there is a zero delay edge from 119906 rarr V(9) For all vertices from the sorted list

(10) If non zero delay on the edge E in G then

(11) 120574119894= 119905V

(12) else(13) 120574

119894= 119905V + max(120574

119894) isin edge 119890 119906 rarr V in 119866 with 119889

119890= 0

(14) end if

(15) 120574 = 1205741 1205742 120574

119898

(16) where m = number of entries in the topologically sorted list

(17) end for

(18) compute 120574 = max120574

Algorithm 2

Since HDL does not provide a method to represent infinitysome number say 255 can be chosen which is always greaterthan any otherweight in the incidencematrix Also since edgeweight 0 is a valid input any negative number say minus1 canbe used to denote the uninitialized matrix element In state1198781 all the zero weight edges in the DFG are found alongwith their source and destination nodes and are stored in amatrix called zero weight pathThe zero weight pathmatrixcontains two columns The first column contains the sourcenode of a directed zero-weight edge while the second columnhas the destination node of the directed zero-weight edgeSimultaneously we will keep a count on the number of zero-weight edges

The state 1198782 is provided to enable looping action andfor updating of all the signals In state 1198783 in each row ofzero weight path matrix the module will find the next node

with a zero-weight edge connecting it to the node in theprevious column (if it exists) Thus if the destination node inany zero-weight path is same as the source node in anotherzero-weight path the two paths are concatenated that is ifthe destination node in path 119886 is the source node in path 119887then we make the destination node in 119887 as the destinationnode of 119886 The state 1198784 is provided to enable looping actionand for updating all the signals At the end of this state the119911119890119903119900 119908119890119894119892ℎ119905 119901119886119905ℎ matrix will contain only those supersetpaths that are a superset of the remaining zero weight pathsIn state 1198785 the module calculates the sum of all the nodeweights through each of these paths State 1198786 is provided forlooping action and for updating all signals

In state 1198787 the path with the highest node weights sum isfound which is the critical path of the DFG All the nodes inthis path are then stored in order in amatrix called the critical

VLSI Design 9

(1) Algorithm for computing the shortest path

(2) Input a DFG of G = (VEtd) Where c is the computation time of the node and d

(3) is the initial delay on edge E

(4) Output All pair shortest path matrix M

(5) for i = 1 to N

(6) for j = 1 to N

(7) if i = j then

(8) M[ij] = (00)

(9) else M[ij] = inf

(10) end for

(11) end for

(12) for all the edges 119890 119906 rarr V119872[119906 V] = 119889 for edge e

(13) for 119896 rarr 1 to N

(14) for 119894 rarr 1 to N

(15) for 119895 rarr 1 to N

(16) if 119872[119894 119895] gt 119872[119894 119896] + 119872[119896 119895]

(17) M[ij] = M[ik] + M[kj]

(18) end for

(19) end for

(20) end for

(21) Output shortest path matrix M

Algorithm 3

1 2 3 4 50

2

4

6

8

10

12

14

16

18

Vertex number in filter DFG

Zero

del

ay p

ath

for fi

lter D

FG

Figure 7 Zero path delays and critical path for 4th-order low passelliptic filter

path matrix These signals in this matrix are output as thecritical pathThe state 1198788 is provided to enable looping actionand for updating all the signals The state machine then goesback to state 1198780 and awaits new inputs Next algorithm to findthe shortest path between two nodes in a graph is describedFor retiming technique in high level synthesis we need theshortest path to solve system of inequalities It is seen thattime needed to compute critical path on FPGA is reasonablyless when compared to computation on general purposeprocessor This also reduces the retiming computation timeThe zero delay paths are computed for 4th-order elliptic filtershown in Figure 7 The highlighted path delay is from1 rarr

9 rarr 5 rarr 14 rarr 8 where nodes 1 5 8 are adders and 9 14

are multipliers Maximum path delay which is highlighted isconsidered to be the critical path

32 Shortest Path Solver Algorithm and State Diagram Let119863(119906 V) be the maximum delay between nodes 119906 and V andlet 119879(119906 V) be total computation time of zero delay path from119906 to V We can check the condition 119879(119906 V) minusmin119905(119906) 119905(V) gt119889119890119903119894V119890119889 119888119897119900119888119896 119901119890119903119894119900119889 then select those paths to retime sothat computation time in this path can be reduced Wehave to retime the edges by constructing system of linearinequalities This can be done using Floyd-Warshall shortestpath algorithmAlgorithm 3This can be used for retiming thegraph further (Figure 6)

Floyd-Warshall all pair shortest path algorithm isdesigned and implemented as a part of path solvers on FPGA[17] which reduces the computational burden of generalpurpose processor where actual retiming has been carriedout The speed of computation is also increased by a largerextent The HDL program for the shortest path solver onFPGA was designed based on the state diagram shown inFigure 8 Updating of the looping variables is done in 1198781 andthen transition from 1198781 to 1198780 occurs The transition from 1198780

to 1198782 occurs after the incidence matrix is completely copiedto the signal weight temp In state 1198782 the signal weight tempis operated upon to obtain the pair wise shortest path matrixwith state 1198783 enabling looping action Transition from 1198782 to1198783 takes place after each pair wise path distance is foundUpdating of the looping variables is done in 1198783 and thentransition from 1198783 to 1198782 occurs The transition from 1198782 to1198784 occurs after all the pair wise shortest paths are stored inthe signal weight temp In the state 1198784 the elements of thesignal matrix weight temp are copied to the output matrix

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

VLSI Design 7

MCMblock

Y

k1Y

k2Y

k3Y

k4Y

k5Y

(a)

Y

8Y

32Y

(minus1)7Y

(minus1)31Y

O1

O2

14Y

O3 = 45Y = 31Y + 14Y

(+)45Y

O3

O1 = 7Y = 8Y minus 1Y

O2 = 31Y = 32Y minus 1Y

≪3

≪5

≪1

(b)

Figure 5 Example for addressing MCM problem in digital filters

3 Design and Analysis

EachDSPfilter block is associatedwith the critical pathwhichlimits maximum iteration period in the filter design [12]This can be reduced by retiming where the clock period getsreduced and increases the clock speed To reduce the criticalpath we need to find the original critical path of the circuitusing critical path solving algorithm and then apply retimingtransformation to digital filter While retiming shortest pathalgorithm is required for solving the system inequalitiesFPGAs are nothing but set of configurable logic blockswith configurable interconnects Designer can program itto work like a specific hardware These give great speedupover general purpose processors for many long runningalgorithms Hence for high performance systems FPGAsbecome a better choice In the present work path solvers areimplemented on FPGA to increase the performance

31 Critical Path Solver Algorithm Design and Analysis Thecritical path is defined as maximum delay path between theoutput node and node causing the state change of the outputnode with zero delay The significance of the critical pathis that it determines the operating frequency of the designIn retiming which is one among the steps in high levelsynthesis it is imperative that we find the critical path [17]in real time To speed up this process the use of a dedicatedFPGA hardware can speed up the process with low powerConsider 120572 = Number of adder elements and 120573 = Numberof multiplier elements in the considered digital filter Let= 1198991 1198992 119899

119894 where 119894 = 120572 + 120573 which is maximum

combinational adder and multiplier elements Consider 119874 =1199001 1199002 119900

119895where 119874 is the set of output nodes in the filter

circuit and 119868 = 1198941 1198942 119894

119896 where 119868 is the set of input

nodes in the filter circuit such that IN and ON The criticalpath of the circuit is defined in terms of 120574

1198991

which is thedelay of individual combinational block In this procedureof computing critical path on FPGA it sorts the verticessuch that vertices occurring early in the list are connected tovertices later in the list by edges having zero delays While

sorting if the vertex is connected to previous one then pathlength is sum of its time with the sum of all the vertices foundin the path otherwise path length of the node is equal toits own computation time We need this for constructing theretimed graph as well as verifying the retimed graph resultThe equation of the critical path is

120574119898=

119894=119873

sum

119894=1

1199051198981

(9)

where 119873 is the sum of adder and multiplier elements in thetopologically sorted vertices connectedwith zero delay edgesThe delay of the circuit is given by 119905

119889= max120574

119898 where

119905119889is the delay of the critical path Algorithm 2 shows the

critical path formulation In the considered optimizationenvironment the below steps are used for critical pathcomputation

(i) The filter network graph is considered as input tocritical path solver algorithm

(ii) All the zero-weight edges in the network graph arefound and a matrix of their source and destinationnodes is formed

(iii) For each row in the above matrix if the destinationnode of any zero-weight edge path is the same as thesource node of the zero-weight edge path the twopaths are joined This step is repeated to obtain amatrix whose rows will have nodes of all the possiblezero-weight edge paths in the graph

(iv) The computational time of each zero-weight edgepath from this matrix is calculated

(v) The zero-weight edge path with the greatest compu-tational time is found This is the critical path and itscomputational time is the critical path delay

A critical path solver algorithm is designed in the presentwork on FPGA The state diagram for the implementedcritical path solver is given in Figure 6 In 1198780 the filter graphor matrix is given as input to the critical path solver module

8 VLSI Design

Loop indices updated

Loop indices updated

Next greater path delay found

Loop indices updated

Loop indices updated

All

path

del

ay fo

und

Superset of zero-weight path found

All

zero

pat

h de

lay

foun

d

Weight temp updated

Weight temp updated

S0

S1 S2

S3 S4S5S6

S7S8

Incidence and node matrix

obtained

Greates

t path

delay

found

Zero-weight path updated

Figure 6 Critical path solver state diagram

(1) Algorithm for computing the critical path

(2) Input a DFG of G = (VEtd) Where c is the

(3) computation time of the node and d

(4) is the initial delay on edge E

(5) Output Critical path C

(6) Sort all the vertices topologically in the DFG G

(7) with v fallowing u

(8) if there is a zero delay edge from 119906 rarr V(9) For all vertices from the sorted list

(10) If non zero delay on the edge E in G then

(11) 120574119894= 119905V

(12) else(13) 120574

119894= 119905V + max(120574

119894) isin edge 119890 119906 rarr V in 119866 with 119889

119890= 0

(14) end if

(15) 120574 = 1205741 1205742 120574

119898

(16) where m = number of entries in the topologically sorted list

(17) end for

(18) compute 120574 = max120574

Algorithm 2

Since HDL does not provide a method to represent infinitysome number say 255 can be chosen which is always greaterthan any otherweight in the incidencematrix Also since edgeweight 0 is a valid input any negative number say minus1 canbe used to denote the uninitialized matrix element In state1198781 all the zero weight edges in the DFG are found alongwith their source and destination nodes and are stored in amatrix called zero weight pathThe zero weight pathmatrixcontains two columns The first column contains the sourcenode of a directed zero-weight edge while the second columnhas the destination node of the directed zero-weight edgeSimultaneously we will keep a count on the number of zero-weight edges

The state 1198782 is provided to enable looping action andfor updating of all the signals In state 1198783 in each row ofzero weight path matrix the module will find the next node

with a zero-weight edge connecting it to the node in theprevious column (if it exists) Thus if the destination node inany zero-weight path is same as the source node in anotherzero-weight path the two paths are concatenated that is ifthe destination node in path 119886 is the source node in path 119887then we make the destination node in 119887 as the destinationnode of 119886 The state 1198784 is provided to enable looping actionand for updating all the signals At the end of this state the119911119890119903119900 119908119890119894119892ℎ119905 119901119886119905ℎ matrix will contain only those supersetpaths that are a superset of the remaining zero weight pathsIn state 1198785 the module calculates the sum of all the nodeweights through each of these paths State 1198786 is provided forlooping action and for updating all signals

In state 1198787 the path with the highest node weights sum isfound which is the critical path of the DFG All the nodes inthis path are then stored in order in amatrix called the critical

VLSI Design 9

(1) Algorithm for computing the shortest path

(2) Input a DFG of G = (VEtd) Where c is the computation time of the node and d

(3) is the initial delay on edge E

(4) Output All pair shortest path matrix M

(5) for i = 1 to N

(6) for j = 1 to N

(7) if i = j then

(8) M[ij] = (00)

(9) else M[ij] = inf

(10) end for

(11) end for

(12) for all the edges 119890 119906 rarr V119872[119906 V] = 119889 for edge e

(13) for 119896 rarr 1 to N

(14) for 119894 rarr 1 to N

(15) for 119895 rarr 1 to N

(16) if 119872[119894 119895] gt 119872[119894 119896] + 119872[119896 119895]

(17) M[ij] = M[ik] + M[kj]

(18) end for

(19) end for

(20) end for

(21) Output shortest path matrix M

Algorithm 3

1 2 3 4 50

2

4

6

8

10

12

14

16

18

Vertex number in filter DFG

Zero

del

ay p

ath

for fi

lter D

FG

Figure 7 Zero path delays and critical path for 4th-order low passelliptic filter

path matrix These signals in this matrix are output as thecritical pathThe state 1198788 is provided to enable looping actionand for updating all the signals The state machine then goesback to state 1198780 and awaits new inputs Next algorithm to findthe shortest path between two nodes in a graph is describedFor retiming technique in high level synthesis we need theshortest path to solve system of inequalities It is seen thattime needed to compute critical path on FPGA is reasonablyless when compared to computation on general purposeprocessor This also reduces the retiming computation timeThe zero delay paths are computed for 4th-order elliptic filtershown in Figure 7 The highlighted path delay is from1 rarr

9 rarr 5 rarr 14 rarr 8 where nodes 1 5 8 are adders and 9 14

are multipliers Maximum path delay which is highlighted isconsidered to be the critical path

32 Shortest Path Solver Algorithm and State Diagram Let119863(119906 V) be the maximum delay between nodes 119906 and V andlet 119879(119906 V) be total computation time of zero delay path from119906 to V We can check the condition 119879(119906 V) minusmin119905(119906) 119905(V) gt119889119890119903119894V119890119889 119888119897119900119888119896 119901119890119903119894119900119889 then select those paths to retime sothat computation time in this path can be reduced Wehave to retime the edges by constructing system of linearinequalities This can be done using Floyd-Warshall shortestpath algorithmAlgorithm 3This can be used for retiming thegraph further (Figure 6)

Floyd-Warshall all pair shortest path algorithm isdesigned and implemented as a part of path solvers on FPGA[17] which reduces the computational burden of generalpurpose processor where actual retiming has been carriedout The speed of computation is also increased by a largerextent The HDL program for the shortest path solver onFPGA was designed based on the state diagram shown inFigure 8 Updating of the looping variables is done in 1198781 andthen transition from 1198781 to 1198780 occurs The transition from 1198780

to 1198782 occurs after the incidence matrix is completely copiedto the signal weight temp In state 1198782 the signal weight tempis operated upon to obtain the pair wise shortest path matrixwith state 1198783 enabling looping action Transition from 1198782 to1198783 takes place after each pair wise path distance is foundUpdating of the looping variables is done in 1198783 and thentransition from 1198783 to 1198782 occurs The transition from 1198782 to1198784 occurs after all the pair wise shortest paths are stored inthe signal weight temp In the state 1198784 the elements of thesignal matrix weight temp are copied to the output matrix

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

8 VLSI Design

Loop indices updated

Loop indices updated

Next greater path delay found

Loop indices updated

Loop indices updated

All

path

del

ay fo

und

Superset of zero-weight path found

All

zero

pat

h de

lay

foun

d

Weight temp updated

Weight temp updated

S0

S1 S2

S3 S4S5S6

S7S8

Incidence and node matrix

obtained

Greates

t path

delay

found

Zero-weight path updated

Figure 6 Critical path solver state diagram

(1) Algorithm for computing the critical path

(2) Input a DFG of G = (VEtd) Where c is the

(3) computation time of the node and d

(4) is the initial delay on edge E

(5) Output Critical path C

(6) Sort all the vertices topologically in the DFG G

(7) with v fallowing u

(8) if there is a zero delay edge from 119906 rarr V(9) For all vertices from the sorted list

(10) If non zero delay on the edge E in G then

(11) 120574119894= 119905V

(12) else(13) 120574

119894= 119905V + max(120574

119894) isin edge 119890 119906 rarr V in 119866 with 119889

119890= 0

(14) end if

(15) 120574 = 1205741 1205742 120574

119898

(16) where m = number of entries in the topologically sorted list

(17) end for

(18) compute 120574 = max120574

Algorithm 2

Since HDL does not provide a method to represent infinitysome number say 255 can be chosen which is always greaterthan any otherweight in the incidencematrix Also since edgeweight 0 is a valid input any negative number say minus1 canbe used to denote the uninitialized matrix element In state1198781 all the zero weight edges in the DFG are found alongwith their source and destination nodes and are stored in amatrix called zero weight pathThe zero weight pathmatrixcontains two columns The first column contains the sourcenode of a directed zero-weight edge while the second columnhas the destination node of the directed zero-weight edgeSimultaneously we will keep a count on the number of zero-weight edges

The state 1198782 is provided to enable looping action andfor updating of all the signals In state 1198783 in each row ofzero weight path matrix the module will find the next node

with a zero-weight edge connecting it to the node in theprevious column (if it exists) Thus if the destination node inany zero-weight path is same as the source node in anotherzero-weight path the two paths are concatenated that is ifthe destination node in path 119886 is the source node in path 119887then we make the destination node in 119887 as the destinationnode of 119886 The state 1198784 is provided to enable looping actionand for updating all the signals At the end of this state the119911119890119903119900 119908119890119894119892ℎ119905 119901119886119905ℎ matrix will contain only those supersetpaths that are a superset of the remaining zero weight pathsIn state 1198785 the module calculates the sum of all the nodeweights through each of these paths State 1198786 is provided forlooping action and for updating all signals

In state 1198787 the path with the highest node weights sum isfound which is the critical path of the DFG All the nodes inthis path are then stored in order in amatrix called the critical

VLSI Design 9

(1) Algorithm for computing the shortest path

(2) Input a DFG of G = (VEtd) Where c is the computation time of the node and d

(3) is the initial delay on edge E

(4) Output All pair shortest path matrix M

(5) for i = 1 to N

(6) for j = 1 to N

(7) if i = j then

(8) M[ij] = (00)

(9) else M[ij] = inf

(10) end for

(11) end for

(12) for all the edges 119890 119906 rarr V119872[119906 V] = 119889 for edge e

(13) for 119896 rarr 1 to N

(14) for 119894 rarr 1 to N

(15) for 119895 rarr 1 to N

(16) if 119872[119894 119895] gt 119872[119894 119896] + 119872[119896 119895]

(17) M[ij] = M[ik] + M[kj]

(18) end for

(19) end for

(20) end for

(21) Output shortest path matrix M

Algorithm 3

1 2 3 4 50

2

4

6

8

10

12

14

16

18

Vertex number in filter DFG

Zero

del

ay p

ath

for fi

lter D

FG

Figure 7 Zero path delays and critical path for 4th-order low passelliptic filter

path matrix These signals in this matrix are output as thecritical pathThe state 1198788 is provided to enable looping actionand for updating all the signals The state machine then goesback to state 1198780 and awaits new inputs Next algorithm to findthe shortest path between two nodes in a graph is describedFor retiming technique in high level synthesis we need theshortest path to solve system of inequalities It is seen thattime needed to compute critical path on FPGA is reasonablyless when compared to computation on general purposeprocessor This also reduces the retiming computation timeThe zero delay paths are computed for 4th-order elliptic filtershown in Figure 7 The highlighted path delay is from1 rarr

9 rarr 5 rarr 14 rarr 8 where nodes 1 5 8 are adders and 9 14

are multipliers Maximum path delay which is highlighted isconsidered to be the critical path

32 Shortest Path Solver Algorithm and State Diagram Let119863(119906 V) be the maximum delay between nodes 119906 and V andlet 119879(119906 V) be total computation time of zero delay path from119906 to V We can check the condition 119879(119906 V) minusmin119905(119906) 119905(V) gt119889119890119903119894V119890119889 119888119897119900119888119896 119901119890119903119894119900119889 then select those paths to retime sothat computation time in this path can be reduced Wehave to retime the edges by constructing system of linearinequalities This can be done using Floyd-Warshall shortestpath algorithmAlgorithm 3This can be used for retiming thegraph further (Figure 6)

Floyd-Warshall all pair shortest path algorithm isdesigned and implemented as a part of path solvers on FPGA[17] which reduces the computational burden of generalpurpose processor where actual retiming has been carriedout The speed of computation is also increased by a largerextent The HDL program for the shortest path solver onFPGA was designed based on the state diagram shown inFigure 8 Updating of the looping variables is done in 1198781 andthen transition from 1198781 to 1198780 occurs The transition from 1198780

to 1198782 occurs after the incidence matrix is completely copiedto the signal weight temp In state 1198782 the signal weight tempis operated upon to obtain the pair wise shortest path matrixwith state 1198783 enabling looping action Transition from 1198782 to1198783 takes place after each pair wise path distance is foundUpdating of the looping variables is done in 1198783 and thentransition from 1198783 to 1198782 occurs The transition from 1198782 to1198784 occurs after all the pair wise shortest paths are stored inthe signal weight temp In the state 1198784 the elements of thesignal matrix weight temp are copied to the output matrix

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

VLSI Design 9

(1) Algorithm for computing the shortest path

(2) Input a DFG of G = (VEtd) Where c is the computation time of the node and d

(3) is the initial delay on edge E

(4) Output All pair shortest path matrix M

(5) for i = 1 to N

(6) for j = 1 to N

(7) if i = j then

(8) M[ij] = (00)

(9) else M[ij] = inf

(10) end for

(11) end for

(12) for all the edges 119890 119906 rarr V119872[119906 V] = 119889 for edge e

(13) for 119896 rarr 1 to N

(14) for 119894 rarr 1 to N

(15) for 119895 rarr 1 to N

(16) if 119872[119894 119895] gt 119872[119894 119896] + 119872[119896 119895]

(17) M[ij] = M[ik] + M[kj]

(18) end for

(19) end for

(20) end for

(21) Output shortest path matrix M

Algorithm 3

1 2 3 4 50

2

4

6

8

10

12

14

16

18

Vertex number in filter DFG

Zero

del

ay p

ath

for fi

lter D

FG

Figure 7 Zero path delays and critical path for 4th-order low passelliptic filter

path matrix These signals in this matrix are output as thecritical pathThe state 1198788 is provided to enable looping actionand for updating all the signals The state machine then goesback to state 1198780 and awaits new inputs Next algorithm to findthe shortest path between two nodes in a graph is describedFor retiming technique in high level synthesis we need theshortest path to solve system of inequalities It is seen thattime needed to compute critical path on FPGA is reasonablyless when compared to computation on general purposeprocessor This also reduces the retiming computation timeThe zero delay paths are computed for 4th-order elliptic filtershown in Figure 7 The highlighted path delay is from1 rarr

9 rarr 5 rarr 14 rarr 8 where nodes 1 5 8 are adders and 9 14

are multipliers Maximum path delay which is highlighted isconsidered to be the critical path

32 Shortest Path Solver Algorithm and State Diagram Let119863(119906 V) be the maximum delay between nodes 119906 and V andlet 119879(119906 V) be total computation time of zero delay path from119906 to V We can check the condition 119879(119906 V) minusmin119905(119906) 119905(V) gt119889119890119903119894V119890119889 119888119897119900119888119896 119901119890119903119894119900119889 then select those paths to retime sothat computation time in this path can be reduced Wehave to retime the edges by constructing system of linearinequalities This can be done using Floyd-Warshall shortestpath algorithmAlgorithm 3This can be used for retiming thegraph further (Figure 6)

Floyd-Warshall all pair shortest path algorithm isdesigned and implemented as a part of path solvers on FPGA[17] which reduces the computational burden of generalpurpose processor where actual retiming has been carriedout The speed of computation is also increased by a largerextent The HDL program for the shortest path solver onFPGA was designed based on the state diagram shown inFigure 8 Updating of the looping variables is done in 1198781 andthen transition from 1198781 to 1198780 occurs The transition from 1198780

to 1198782 occurs after the incidence matrix is completely copiedto the signal weight temp In state 1198782 the signal weight tempis operated upon to obtain the pair wise shortest path matrixwith state 1198783 enabling looping action Transition from 1198782 to1198783 takes place after each pair wise path distance is foundUpdating of the looping variables is done in 1198783 and thentransition from 1198783 to 1198782 occurs The transition from 1198782 to1198784 occurs after all the pair wise shortest paths are stored inthe signal weight temp In the state 1198784 the elements of thesignal matrix weight temp are copied to the output matrix

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

10 VLSI Design

Loop indices updated

All pair shortest paths foundPairwise shortest path found

Loop indices updated

Loop

indi

ces u

pdat

ed

Weight temp updated

Wei

ghtt

emp

com

plet

e arr

ay to

out

put m

atrix

Weight tem

p copied

to ou

tput matr

ix Incidence matrix copied to weight temp

S0 S1

S2 S3

S3

S4

Figure 8 Shortest path solver state diagram

The state 1198785 enables looping action for 1198784 Transition from 1198784

to 1198780 occurs after the output matrix is available with all thepair wise shortest paths The state machine is then initializedand awaits new inputs

SPM =

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

inf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3

inf inf inf inf 3 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 3 2 1 inf inf inf inf inf 1 inf inf infinf inf inf inf 2 1 3 2 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 3 inf inf inf inf inf 3 inf inf infinf inf inf inf 0 2 1 0 inf inf inf inf inf 0 inf inf infinf inf inf inf 1 0 2 1 inf inf inf inf inf 1 inf inf inf1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

inf inf inf inf 2 1 0 1 inf inf inf inf inf 2 inf inf infinf inf inf inf 3 2 1 0 inf inf inf inf inf 3 inf inf inf3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 3

4 3 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4

3 2 1 0 3 3 2 1 3 3 3 3 3 3 0 3 3

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

]

(10)

33 Multiplierless Digital Filters The digital FIR filters andthe transposed IIR filters will have block of multipliers in thefilter structure This is shown in Figure 9

For a target set 119879 = 1199051 1199051 119905

119899 in digital filter we

have to find the ready set 119877 = 1199030 1199031 119903

119898 that is small

and 119860119900119901119890119903119886119905119894119900119899 composed of minimum number of addi-tion subtraction and shift operations After this target setis obtained multiplierless multiple constant multiplicationfilters can be designed with this target set Multiple constantmultiplication (MCM) is an efficient way of implementing

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

VLSI Design 11

Multiplier block

Multiplier block

+ + + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1

Zminus1

middot middot middot

(a)

Multiplier block

Multiplier block

+ + +

+

+++

Output

Input

Zminus1 Zminus1 Zminus1 Zminus1 Zminus1

Zminus1Zminus1Zminus1Zminus1Zminus1

middot middot middot

middot middot middot

(b)

Multiplier block

+ + + Output

Input

Zminus1 Zminus1 Zminus1 Zminus1middot middot middot

(c)

Figure 9 General structure of MCM block for (a) FIR filter (b) transposed direct form-I IIR filter and (c) transposed direct form-II IIRfilter

several constant multiplications with the input data [18 19]The coefficients are implemented using shifts adders andsubtracters By removing the redundancy between the coeffi-cients the number of adders and subtracters is reducedwhichresults in a low complexity implementation Retiming formultiplierless MCM filters is still unexplored in the literatureand authors have combined retiming for multiplierlessMCMfilters which shows decrease in the combinational path delayFor filter graph 119866 multiplierless MCM filter can be designedusing target set and 119860119900119901119890119903119886119905119894119900119899119904 and multiplierless MCMfilter graph 119866

119894is obtained This is again retimed to increase

the speed performance of 119866119894by modifying the critical path

of the filter The graph after retiming of multiplierless MCMfilter is considered as119866

119903 In the present work119867cub algorithm

is used for 119866119894computation The input to the 119867cub algorithm

is target set 119879 and algorithm computes a ready set 119877 which isthe output solution The 119877 set computation requires multipleiterations and in each iteration successor set 119878 of 119877 is chosenas the next fundamental based on the heuristic Here 119878whichis set of constants of distance 1 from 119877 is given as

119878 = 119904 | dist (119877 119904) = 1 = 119860119904(119877 119877) (11)

For the target set of constants 119879 for the consideredfilter graph 119866 using 119867cub algorithm compute set 119877 =

1199031 1199032 119903

119898 with 119879 isin 119877 If the targets are found in

the 119878 then it is optimal synthesis Here heuristic function119867(119877 119878 119879) of an algorithm can be chosen when no moretargets are found in 119878 This can happen when all the targetsaremore than one119860119900119901119890119903119886119905119894119900119899 awayThe optimal part is when(119879 cap 119878 = 120601) then there is a target in the successor set and itcan be synthesized Optimal set is the one in which the entiretarget is synthesized in this way and the solution is optimalIn heuristic part the computation can be done by two ways

(i) maximum benefit(ii) cumulative benefit

To build the heuristic we can define the benefit functionas 119861(119877 119904 119905)

119861 (119877 119904 119905) = dist (119877 119905) minus dist (119877 + 119904 119905) (12)

A successor 119904 isin 119878 needs to be picked which is closest tothe target set to minimize the cost This is possible if we cancompute or estimate the A-Distance It is useful to also takeinto account the current estimate of the distance between 119877and 119879 Thus to build the heuristic we must first define thebenefit function 119861(119877 119904 119905) to quantify to what extent addinga successor s to the ready set 119877 improves the distance to afixed but arbitrary target 119905 However for remote targets theestimate becomes less accurate hence we can have weightedbenefit function given as

119861119887(119877 119904 119905) = 10

dist(119877+119904119905)(dist (119877 119905) minus dist (119877 + 119904 119905)) (13)

where 10dist(119877+119904119905) is a weight factor and decreases exponen-tially as 119905 grows The benefit function for different targets 119905can be added and joint optimization can be achieved by usingcumulative benefit which is used in the present work Henceheuristic function for cumulative benefit is given by

119867cub (119877 119878 119879) = arg[max[sum119905isin119879

119861119887(119877 119878 119905)]] (14)

Here cumulative benefit heuristic adds up the weightedbenefit considering all the targets With this particularmethod target set is calculated With this target set filtergraph which is multiplierless MCM based can be designed Itis found that multiplierless designs reduce the combinationalpath delays and due to sharing of intermediate results in theMCM approach The performance can be further improvedby retiming 119866

119894to give 119866

119903 These two different optimization

techniques reduce the combination delay and critical path

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 12: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

12 VLSI Design

+

++++

+

minus

minus

+

++

+

++

++

+

minus

+ ++

+

++

+

++

+

+

minus

+minus

Zminus1 Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Zminus1

Qy = Qu ≪ 1

Ey = Eu

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 2

Ey = Eu

Qy = Qu ≪ 1

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 5

Ey = Eu

Qy = Qu ≪ 6

Ey = Eu

Qy = Qu ≪ 3

Input Output1

1

minusu

minusu

Vy = Vu lowast 21

Vy = Vu lowast 21

Vy = Vu lowast 22

Vy = Vu lowast 21

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 25

Vy = Vu lowast 26

Vy = Vu lowast 23

Figure 10 Multiplierless MCM based 4th-order elliptic filter

without changing the functionality which further increasesthe clock speedThe 4th-order lattice filter withmultiplierlessMCM concept using119867cub algorithm is shown in Figure 10 Itis seen from the synthesis that combination delay is reducedIt is further retimed either for clock period minimization orregister minimization This requires solving a set of linearinequalities with a computation complexity of119874(1198993) where 119899is the number of nodes using the Floyd-Warshall algorithmwhere 119899 is the number of nodes [8] The clock periodminimization and register minimization retiming algorithmsare designed and implemented with FPGA based path solverswhich reduces computation timewhen compared to previousmethods [8 16] to design multiplierless digital filters

The algorithm starts by building a new graph fromthe original DFG The new graph can give us a set ofinequalities called the critical path constraints The originalDFG also presents a set of equalities called the feasibilityconstraints A constraint graph can be built from the criticalpath constraints and the feasibility constraints The retimingvalues for each node can be derived by applying a Floyd-Warshall shortest path algorithm to the constraint graphTheweight for each edge in the retimed DFG can be calculatedusing the original weight and the retiming values of the twonodes connected by this edge The improvement in the clock

frequency is shown in Figure 11 Here 4th-order lattice filteris considered 1198631198901199041198941198921198991 is the filter with multipliers andwithout retiming1198631198901199041198941198921198992 is multiplierlessMCMbased filterwithout retiming 1198631198901199041198941198921198993 is the filter with multipliers withretiming and 1198631198901199041198941198921198994 is multiplierless MCM based latticefilter with retiming The maximum operating frequency ofthe filter has increased by 196 in multiplierless MCMapproach as multipliers will get eliminated and get replacedby adders which have much less computation delay Furtherit is observed that by combining this approach with retimingoperating frequency increases by 354which is a significantincrease However with this technique the number of regis-ters increases from 9 to 11

Hence when the filter is designed without multipliers(that is using only adderssubtractors and shifters) along withthe retiming technique operating clock speed is found toincrease which gives a greater speed advantage for the designunder consideration

34 Computer Aided Design Tool This section presents theDiFiDOT tool which is designed as the part of researchwork Initially the design of filters is performed using retimedarchitecture where user can choose either clock period

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 13: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

VLSI Design 13

Design1 Design2 Design3 Design40

10

20

30

40

50

60

70

80

Frequency in MHzNumber of registers

Figure 11 Comparison of operating frequency and number ofregisters for different filter designs of 4th-order elliptic filter

minimization or register minimization retiming as per hisneed The tool will retime the digital filter by optimizingthe critical path and generate verilogVHDL based filter RTLfor the sameThe performance of a filter can also be increasedby varying the choice combinational adder and multiplierelements in the RTL filter description A graphical userinterface (GUI) is created in DiFiDOT using Nokia QT 480for component selection and optimization of digital filtersHere user has to input the HDL file which was automaticallygenerated after retiming for further component optimizationThe user can choose adders and multipliers of his choiceaccording to the design requirements for the retimed digitalfilters using drop down menu The original HDL is auto-matically modified with respect to the components chosenwhich is again synthesizable and is given as the output tothe user This easy to use GUI helps designer to optimizeand generate digital filter RTL with the adder and multipliersof his choice With this designer can conveniently explorethe solution space of possible architectures and also analyzethe trade-offs in the energy-area-performance space [20]Thedifferent adder and multipliers considered in the tool are asbelow

Multiplier Architecture Themost critical function carried outby any filter is multiplication Digital multiplication [19] isthe most extensively used operation in signal processingInnumerable schemes have been proposed for realizationof the operation In this paper we consider three types ofmultipliers

Array Multiplier It is the basic type of multiplier Considertwo binary numbers 119860 and 119861 of 119899 bits respectively Themultiplication is given as

119860 =

119899minus1

sum

119894=0

119860

1198942

119894 119861 =

119899minus1

sum

119895=0

119860

119894119861

1198942

119894+119895

119875 =

119899

sum

119894=0

119899minus1

sum

119895=0

119860

119894119861

1198942

(119894+119895)119860

119894119861

1198942

119894+119895

(15)

In each stage the partial products 119875119894are generated that are

added to obtain final product 119875 In general for 119898 lowast 119899 arraymultiplier we need119898lowast119899AND gates 119899 half adders and (119898minus2) lowast 119899 full adders

Radix 4 Booth Multiplier It has the advantage of lesser areaand fastermultiplication comparedwith arraymultiplicationRadix 4 Booths Algorithm can scan strings of three bits andis converted depending on modified Booth encoder tableThe design of Booths multiplier in this project consists offour Modified Booth Encoders (MBE) four sign extensioncorrectors four partial product generators (comprises of 5 1multiplexer) and finally a Ripple carry Adder This Boothmultiplier technique is to increase speed by reducing thenumber of partial products by half Since a 32-bit boothmultiplier is used in this project there are only sixteen partialproducts that need to be added instead of 32 partial productsgenerated using conventional multiplier

VedicMultiplier It is used for faster multiplication operationsin higher order bits It has less combinational path delay[21] compared with others when the bit size is higherHowever it consumes more area than Booth multiplier andarray multiplier The multiplier is based on an algorithmUrdhva Tiryakbhyam (vertical amp crosswise) Sutra which isa general multiplication formula applicable to all cases ofmultiplication It means vertically and crosswise It is basedon a novel concept throughwhich the generation of all partialproducts can be done with the concurrent addition of thesepartial products The speed advantage is compromised withincreased power dissipation and area Due to its regularstructure layout of this can be easily generated

The different multipliers are designed for different bitsizes and results are compared This is as shown in Table 1

35 Adders In this paper qualitative evaluations of theclassified binary adder architectures are performed sinceadder is another basic component of FIR filter Here Ripple-carry adder BruntKung adder and Ling adder are consideredto emphasize the performance properties Adders affect thecritical path delay and area

Ripple Adder It is the basic adder type This is composedof cascaded full adders for 119899-bit adder It is constructed bycascading full adder blocks in series The carry-out of onestage is fed directly to the carry-in of the next stage For an119899-bit parallel adder it requires n full adders

Parallel-Prefix Adders Parallel prefix adders [22] offer ahighly efficient solution to the binary addition problemAmong all the parallel prefix adders Brunt Kung adder hasa good balance between area power and performance Itis found that Ling adder using Kogge-Stone parallel prefixadder is also having the advantage of faster addition operation[22] but it consumes more power than Brunt Kung Adder

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 14: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

14 VLSI Design

Table 1 Comparison of multipliers for delay power and area

Type of multiplier Delay in ns Power in mW Number of LUTs32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit

Array 761 399 21 21 11 7 1519 375 91Booth 861 2799 149 25 15 12 1277 317 77Vedic 707 3902 244 28 18 12 2378 565 126

The basic equations used in parallel prefix adders are givenbelow The equations of bit generate and propagate are

11986600= 1198660= 119888in

11987500= 1198750= 0

119866119894119895= 119866119894119896+ 119901119894119896lowast 119892(119896minus1)119895

119875119894119895= 119875119894119896lowast 119901(119896minus1)119895

(16)

The sum generation is given by

119878119894= 119875119894XOR119866

(119894minus1)0 (17)

Different Adders are designed for different bit sizes and theirVLSI design metrics are compared as shown in Table 2 Thedelay generated is based on the combinational path delay aftersynthesis It is measured in 119899119904

In the GUI an option is crested for particular adderand multiplier combination also depending on whether theperformance parameter is speed power or area and alsobased on the bit size For example if the design constraintthat user chooses is power then Brent-Kung adder and arraymultiplier pair are considered as the best combination toimplement the filter in the design optimizationGUI User canalso choose any one of his choice among area power or speedconstraint for digital filter HDL generation Along with thisan option is created formultiplierless filter design descriptionas well based on MCM approach It is seen that the retimedMCMcircuits outperform the existingMCMmethods [23] interms of speedUsing this tool user can design retimeddigitalfilter which has combination elements of his choice which arespecific to particular design constraint and generate the RTLfor the sameTheobtainedRTL can be synthesizedwith any ofthe commercially available synthesis toolsTheGUI designedis shown in Figure 12 A 119867cub based algorithm is consideredfor implementingMCM blocks in multiplierless digital filtersfor specific user defined option in DiFiDOT Since all themultipliers can be realised as a block in transposed IIR andFIR filters they are well suited for MCM implementationAfter retiming the multiplier blocks in digital filter canbe replaced by a block constructed by adderssubtractorsnegation operations and shifters in multiplierless designapproach The generated MCM block will have tree depth interms of different components and this depth in our workis assumed to be infinity The tool DiFiDOT automaticallygenerates the HDL of retimed digital filter which is underconsideration which can be directly synthesizable With thistool and automation even if reiteration of the design cyclehappens due to specification change time taken to reiterateis very little

Figure 12 GUI for dDesign optimization environment created togenerate synthesizable retimed digital filterHDLoptimized forVLSIdesign metrics

4 Experimental Results

This section is divided in to three parts the first part presentsthe results of retiming with FPGA based path solvers secondpart presents comparison of various retiming techniquesand third part presents the timing results of retimed filterstructures with MCM blocks

41 Results on Path Solvers for Retiming The main idea ofimplementing path solver algorithms on FPGA is to speed upthe results for retiming purposesThe inputs are passed to theFPGA based path solver block by a processor where retimingalgorithm is implemented The computations are performedin FPGA based block and shortest path along with criticalpath is computed and communicated back to the processorwhere retiming will be performed For comparison a set ofdesigns is used to test the path solver algorithmsThe designsare a diverse set of DSP functions of varying complexitywhich includes recursive and nonrecursive filter structuresThe considered target device for path solver implementationis Spartan6 family based XC6SSLX16 The simulation andsynthesis of path solvers are performed using Xilinx ISE toolsuit and the synthesis and the timing results after synthesisare shown in Table 1 The FPGA based path solver computescritical path and shortest path and communicates the resultsto the processor where retiming is performed This reducesthe burden on main processor (Table 3)

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 15: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

VLSI Design 15

Table 2 Comparison of adders for delay power and area

Type of Delay in ns Power in mW Number of LUTsadder 32 bit 16 bit 8 bit 32 bit 16 bit 8 bit 32 bit 16 bit 8 bitLing 8854 1524 2021 6 9 18 23 53 107BrentKung 104 1839 2583 4 6 9 15 30 63Ripple 1212 2063 376 2 7 14 9 18 36

Table 3 Device utilization and timing summary of path solvers

Path solver name Device utilization summery Timing summery Max frequency (Hz)Logic utilization Used Min period in ns Setup time in ns Hold time in ns

Critical path solverNumber of slices 5804

9068 ns 1572 ns 6141 ns 110277Number of LUTs 10462Number of slice Flipops 3664

Shortest path solverNumber of slices 4147

14089 ns 10477 ns 4114 ns 70978Number of LUTs 7511Number of slice Flipops 1496

Here various IIR and FIR filters have been considered toanalyze the FPGA based path solvers and execution time ofFPGAdesign is comparedwith the general purpose processor(GPP) based design Also GPP denotes the required CPUtime in milliseconds of the path solver to find the minimumsolution on a PC with Intel Pentium 5 machine at 2GHzand 4 GB of memory FPGA based design solves for criticalpath and shortest path in very less time when compared tothe general purpose processor based path solvers The timetaken by the FPGA path solvers is compared in Table 4 to thetime taken by the algorithms run using general purpose pro-cessor with Matlab environment The time overhead neededfor general purpose processor where retiming algorithm isimplemented in MATLAB to communicate with the FPGAbased path solvers is around 210 ns for each computationIncluding this the time gain achieved is quite substantialwhen compared to designs without FPGA based path solversThese time gains are good and can really help speed up theresults which is crucial for retiming

42 Comparison of Clock Period Minimization and RegisterMinimization Retiming Technique Different filter structuresare designed and they are compared with respect to theclock period and register count before and after retimingIt is observed that after retiming the clock period getsreduced The register count gets altered depending on thefilters iteration bound Here three models are considered1198721199001198891198901198971 is the filter without retiming and with adder sub-tractor multiplier and delay elements 1198721199001198891198901198972 is retimedfilter based on clock period minimization algorithm1198721199001198891198901198973is retimed filter based on register minimization algorithmAfter retiming the results are compared with the originalcircuit [24] The comparison results are shown in Figure 13After retiming the finite state machine is extracted from theretimed circuit and it is compared with original circuit for itsfunctionality It is observed that clock period minimizationretiming algorithm is efficient in terms of reduction criticalpath thereby increase in the clock frequency However this

0

5

10

15

20

25

30

35

40

Model 1 clock periodModel 1 reg countModel 2 clock period

Model 2 reg countModel 3 clock periodModel 3 reg count

IIR-2

FIR-2

IIR-4

FIR-4

IIR-6

FIR-6

IIR-8

FIR-8

IIR-10

FIR-10

IIR-12

FIR-12

Figure 13 Clock period and register count before and after retimingfor various digital filter blocks

might increase the register count In register minimizationretiming [18] the number of registers after retiming will bereduced while compromising the clock period

43 Area Power and Timing Results for Digital Filter beforeand after Retiming for Different Adder and Multiplier Com-binations The FIR and IIR filters are designed with respectto different adders and multipliers combinations As anapplication example IIR and FIR filters [25] of order 10are considered Table 5 shows the results of FIRIIR filtersbefore and after retiming for particular adder and multipliercombinations User can choose any adder and multiplier forthe filter circuit depending on the design requirement In

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 16: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

16 VLSI Design

Table 4 Computation time comparison

Filter orderCritical path solver algorithm Shortest path solver algorithm

IIR filter FIR filter IIR filter FIR filterFPGA based GPP based FPGA based GPP based FPGA based GPP based FPGA based GPP based

(ns) (ms) (ns) (ms) (ns) (ms) (ns) (ms)2 460 138 906 1283 278 305 305 12804 1571 1578 1631 1446 368 1391 1391 13196 2998 1918 1923 1547 398 1542 1542 17318 3162 2190 2971 1642 452 2523 2523 329410 3981 2627 3653 1861 536 4293 4293 453412 4672 3142 4328 2352 671 5534 5534 5161

Table 5 Comparison results of different addermultiplier combinations for digital filters

Filter block Addermultiplier combinations Before retiming After retimingNumberof LUTs

Max operatingfreq in MHz

Power inmw

Numberof LUTs

Max operatingfreq in MHz

Power inmw

IIR-10Brentkung AdderArray Multiplier 2222 62526 99 2411 76977 89

Ling AdderVedic Multiplier 2214 69702 112 2193 95381 94Ripple carry AdderBooth

Multiplier 2146 50861 114 1809 65248 95

FIR-10Brentkung AdderArray Multiplier 1736 62526 94 1811 9943 85

Ling AdderVedic Multiplier 2162 72493 111 2271 10072 95Ripple carry AdderBooth

Multiplier 1637 52302 105 1615 71345 87

the GUI particular adder andmultiplier combination is con-sidered depending on whether the performance parameter isdelay power or area and also based on the bit size If userdoes not want to use these in built combinations user canchoose any one of his choice among the available for FIRIIRdigital filter HDL generation with specific combinationalcomponents

44 Results for Optimization of Latency Multiplier Compo-nents and Power in Multiplierless Multiple Constant Multipli-cation Based Filter Designs Using Retiming Algorithm Table 6presents the results of the filters designed usingmultiplierlessMCM approach and optimization using retiming algorithmHere 3 models are used

(i) 119872119900119889119890119897 1 Filter with adder multiplier and delayelements

(ii) 119872119900119889119890119897 2 Filter based on multiplierless multiple con-stant multiplication approach

(iii) 119872119900119889119890119897 3 Retimed multiplierless multiple constantmultiplication based filter

All the three models are compared for the performanceparameters such as area power and delay Here it isensured that functionality of the circuits after and beforeretiming is retained The frequency improvement seen fordifferent filters by considering the above models is given inFigure 14 It is seen that frequency parameter is improvedwhen retiming technique is applied for multiplierless MCMbased digital filters

0

10

20

30

40

50

60

70

80

90

FIR-

2

FIR-

4

FIR-

6

FIR-

8

FIR-

10

FIR-

12

IIR-

2

IIR-

4

IIR-

6

IIR-

8

IIR-

10

IIR-

12

Freq

uenc

y im

prov

emen

t (

)

Filter type

Frequency improvement from model 1 to model 2Frequency improvement from model 1 to model 3

Figure 14 Frequency improvement in factor

5 Application Example

The electrocardiogram (ECG) is the most commonly useddiagnostic method for heart diseases Good quality ECG isutilized by physicians for interpretation and identification ofphysiological and pathological phenomena ECG recordings

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 17: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

VLSI Design 17

Table 6 Comparison of area delay and power for different models of various digital filters

Filter block Adder multipliers Flipflops DelayMax Freq in MHz Power in WattsModel 1 Model 2 Model 3 Model 1 Model 2 Model 3 Model 1 Model 2 Model 3

FIR-2 523 503 504 5154 19214 34062 0056 0063 0065FIR-4 1035 1105 1108 5941 10841 22204 0047 0057 0060FIR-6 727 1707 17014 6291 6764 25947 0051 0062 0064FIR-8 1559 2209 22016 5482 6592 11791 0054 0058 0065FIR-10 18611 25011 25011 4822 5637 10072 0058 0061 0063FIR-12 20713 29013 29013 4634 5486 19340 0060 0063 0067IIR-2 943 1103 1103 5503 7553 8910 0047 0050 0050IIR-4 1675 2005 1906 2278 11388 15165 0059 0062 0063IIR-6 24117 3507 3508 3871 4254 53142 0051 0059 0058IIR-8 301110 33010 3307 2946 7014 11021 0044 0064 0081IIR-10 371613 54013 54014 3643 4885 95381 0051 0067 0085IIR-12 422017 63017 63019 3973 5074 10152 0063 0071 0088

Scope 3

Noise 1

DSP

Filter 2

In 1 Out 8

ECG 1

DSP

Add 1

+

+

(a) (b)

Figure 15 Structure of ECG block for power noise removal (a) block diagram (b) filter block expanded

are often corrupted by high-frequency noises such as power-line interference electromyography (EMG) noise and instru-mentation noise An ECG is usually affected by the 5060Hznoise in the power supply lines This noise can be eliminatedby using a digital filter The model is constructed in matlaband tested for ECG signals for removing the noise Theconstructed model uses retimed multiplierless MCM filterwhich is implemented on FPGA and tested for ECG signalwhich is corrupted by power-line noise The filter efficientlyfilters out the noise and outputs the clean ECG signal TheECG noise removal block using the optimized filter structureis shown in Figure 15

6 Conclusions

In this paper we introduced the retiming approach fordesigning multiplierless MCM based digital filters withspeed and area as the constraint The implementation costat the gate level is reduced by using addition subtrac-tion and shift operations instead of multiplication and byusing register sharing and register minimization retimingalgorithm approach Since there are still instances withwhich multiplierless designs can not cope we also proposed

the combination of adder and multiplier blocks which canbe used in retimed filter design which is applicable forspecific VLSI design constraint such as power area andtiming This yields the optimal clock speed and gate-levelarea in design and implementation of digital filters Thispaper also introduced the design architectures for the digitalfilter and a CAD tool for the realization of retimed digitalfilters which can be either multiplierless MCM based orwith addersubtractor multiplier and delay elements Thistool directly gives the synthesizable filter RTL which reduceslot of designersrsquo time and effort in the design cycle Theexperimental results indicate that the retiming algorithmefficiency can be further increased by using FPGA basedpath solver algorithms proposed in this paper It was shownthat the realization of path solver architectures for solvingcritical path and shortest path in retiming computation andcommunicating the results to the processor where retimingalgorithm is implemented yields significant increase in com-putation time gain when compared to the filter designs forwhich path solver algorithms are implemented as a part ofretiming algorithm in the processor It is observed that adesigner can find the synthesizable digital filter RTL that fitsbest in an application

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 18: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

18 VLSI Design

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] C Soviani O Tardieu and S A Edwards ldquoOptimizing sequen-tial cycles through shannon decomposition and retimingrdquo IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems vol 26 no 3 pp 456ndash467 2007

[2] S Bommu N OrsquoNeill and M Ciesielski ldquoRetiming-based fac-torization for sequential logic optimizationrdquoACMTransactionson Design Automation of Electronic Systems vol 5 no 3 pp373ndash398 2000

[3] K K Parhi ldquoA systematic approach for design of digit-serialsignal processing architecturesrdquo IEEE Transactions on Circuitsand Systems vol 38 no 4 pp 358ndash375 1991

[4] D Yagain A V Krishna and S Chennapnoor ldquoDesign opti-mization platform for synthesizable high speed digital filtersusing retiming techniquerdquo in Proceedings of the 10th IEEEInternational Conference on Semiconductor Electronics (ICSE12) pp 551ndash555 Kuala Lumpur Malaysia September 2012

[5] N Shenoy ldquoRetiming theory and practicerdquo Integration theVLSI Journal vol 22 no 1-2 pp 1ndash21 1997

[6] C E Leiserson and J B Saxe ldquoRetiming synchronous circuitryrdquoAlgorithmica vol 6 no 1ndash6 pp 5ndash35 1991

[7] Y Tsao and K Choi ldquoArea-efficient VLSI implementation forparallel linear-phase FIR digital filters of odd length based onfast FIR algorithmrdquo IEEE Transactions on Circuits and SystemsII Express Briefs vol 59 no 6 pp 371ndash375 2012

[8] K K Parhi VLSI Digital Signal Processing Systems Design andImplementation John Wiley amp Sons 2007

[9] K K Parhi ldquoHierarchical folding and synthesis of iterativedata flow graphsrdquo IEEE Transactions on Circuits and Systems IIExpress Briefs vol 60 no 9 pp 597ndash601 2013

[10] X Zhu T Basten M Geilen and S Stuijk ldquoEfficient retimingof multirate DSP algorithmsrdquo IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol 31 no 6 pp831ndash844 2012

[11] N Liveris C Lin J Wang H Zhou and P Banerjee ldquoRetimingfor synchronous data flowgraphsrdquo inProceedings of the Asia andSouth Pacific Design Automation Conference (ASP-DAC 07)vol 7 pp 480ndash485 Yokohama Japan January 2007

[12] N L Passos E H Sha and S C Bass ldquoOptimizing DSP flowgraphs via schedule-based multidimensional retimingrdquo IEEETransactions on Signal Processing vol 44 no 1 pp 150ndash1551996

[13] J R Jiang and R K Brayton ldquoRetiming and resynthesis acomplexity perspectiverdquo IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems vol 25 no 12 pp2674ndash2686 2006

[14] N Maheshwari and S Sapatnekar ldquoEfficient retiming of largecircuitsrdquo IEEE Transactions on Very Large Scale Integration(VLSI) Systems vol 6 no 1 pp 74ndash83 1998

[15] D Yagain and A Vijaya Krishna ldquoHigh speed digital filterdesign using register minimization retiming amp parallel prefixaddersrdquo in Proceedings of the 3rd International Conference onEmerging Applications of Information Technology (EAIT rsquo12) pp449ndash453 Kolkata India December 2012

[16] J Cong and C Wu ldquoAn efficient algorithm for performance-optimal FPGA technologymappingwith retimingrdquo IEEETrans-actions on Computer-Aided Design of Integrated Circuits andSystems vol 17 no 9 pp 738ndash748 1998

[17] D Yagain A Vijayakrishna P Nikhil A Adarsh and SKarthikeyan ldquoFPGA based path solvers for DFGs in high levelsynthesisrdquo in Proceedings of the 2nd International Conference onAdvances in Computational Tools for Engineering Applications(ACTEA rsquo12) pp 273ndash278 IEEE Beirut Lebanon December2012

[18] Y Voronenko andM Puschel ldquoMultiplierless multiple constantmultiplicationrdquo ACM Transactions on Algorithms vol 3 no 2article 11 Article ID 1240234 2007

[19] K Johansson O Gustafsson and L Wanhammar ldquoMultipleconstant multiplication for digit-serial implementation of lowpower FIR filtersrdquoWSEAS Transactions on Circuits and Systemsvol 5 no 7 pp 1001ndash1008 2006

[20] A Baliga ldquoDesign of high-speed adders for efficient digitaldesign blocksrdquo ISRN Electronics vol 2012 Article ID 2537429 pages 2012

[21] H D Tiwari G Gankhuyag C M Kim and Y B CholdquoMultiplier design based on ancient indian vedic mathematicsrdquoin Proceedings of the International SoC Design Conference(ISOCC rsquo08) vol 2 pp II65ndashII68 Busan Republic of KoreaNovember 2008

[22] G Dimitrakopoulos and D Nikolos ldquoHigh-speed parallel-prefix VLSI ling addersrdquo IEEE Transactions on Computers vol54 no 2 pp 225ndash231 2005

[23] L Aksoy E da Costa P Flores and J Monteiro ldquoExact andapproximate algorithms for the optimization of area and delayin multiple constant multiplicationsrdquo IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems vol27 no 6 pp 1013ndash1026 2008

[24] M N Mneimneh K A Sakallah and J Moondanos ldquoPre-serving synchronizing sequences of sequential circuits afterretimingrdquo in Proceedings of the Asia and South Pacifi c DesignAutomation Conference pp 579ndash584 IEEE Press 2004

[25] D Yagain and K A Vijaya ldquoFir filter design based on retimingand automation using vlsi design metricsrdquo in Proceedings of theInternational Conference on Technology Informatics Manage-ment Engineering and Environment (TIME-E rsquo13) pp 17ndash22IEEE 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 19: Research Article Design of Synthesizable, Retimed Digital ...downloads.hindawi.com/journals/vlsi/2014/280701.pdf · processing (detection, compression, and reconstruction), modems,

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of


Recommended