+ All Categories
Home > Documents > Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

Date post: 03-Jan-2017
Category:
Upload: dangkhanh
View: 221 times
Download: 3 times
Share this document with a friend
18
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998 821 Exhaustive Scheduling and Retiming of Digital Signal Processing Systems Tracy C. Denk, Member, IEEE, and Keshab K. Parhi, Fellow, IEEE Abstract— Scheduling and retiming are important techniques used in the design of hardware and software implementations of digital signal processing algorithms. In this paper, techniques are developed for generating all scheduling and retiming solutions for a strongly connected data-flow graph, allowing a designer to explore the space of possible implementations. Formulations are developed for two scheduling problems. The first scheduling prob- lem assumes a bit-parallel target architecture. The formulation for this problem is general because it considers retiming the data- flow graph as part of scheduling, and this formulation reduces to the retiming formulation as a special case. The second scheduling problem assumes a bit-serial target architecture. Based on these formulations, the conditions for a legal scheduling solution are derived, and a systematic technique is presented for exhaustively generating all legal scheduling solutions for a strongly connected data-flow graph. Since retiming is a special case of scheduling, this systematic technique can also be used for exhaustively generating all legal retiming solutions. A technique is also developed for exhaustively generating only those bit-parallel schedules which satisfy a given set of resource constraints. The techniques for exhaustively generating scheduling and retiming solutions are demonstrated for several filters. For example, we show that a simple filter such as the biquad has 224 possible retiming solutions for a latency of one time unit. We also show that a fifth-order wave digital elliptic filter has 4.7 million and 580 million scheduling solutions for iteration periods of 17 and 18, respectively. Index Terms— Data flow graphs, high-level synthesis, parallel architectures, retiming, scheduling, signal processing, very large scale integration. I. INTRODUCTION T IME scheduling and retiming [1] are important tools used to map behavioral descriptions of algorithms to physical realizations. These tools are used during the design of software for programmable digital signal processors (DSP’s), during high-level synthesis of application-specific integrated circuits (ASIC’s), and during the design of reconfigurable hardware such as field-programmable gate arrays (FPGA’s). Time scheduling and retiming operate directly on a behavioral description of the algorithm, such as a data-flow graph (DFG). Since the decisions made at the algorithmic level tend to have greater impact on the design than those made at lower levels, Manuscript received May 7, 1996; revised December 11, 1997. This work was supported by the Advanced Research Projects Agency and the Solid State Electronics Directorate, Wright-Patterson AFB, under Contract AF/F33615- 93-C-1309. This paper was recommended by Associate Editor B. A. Shenoi. T. C. Denk was with Bell Laboratories, Lucent Technologies, Holmdel, NJ 07733 USA. He is now with Broadcom Corporation, Irvine, CA 92618 USA. K. K. Parhi is with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA. Publisher Item Identifier S 1057-7130(98)05058-7. the importance of time scheduling and retiming cannot be overstated. This paper presents new formulations of the time scheduling and retiming problems, and, based on these formulations, new techniques are developed to determine the solutions to these problems [2]. (From this point forward, we shall refer to time scheduling as simply scheduling.) These formulations are valid for strongly connected (SC) graphs, where a strongly connected graph has a path and a path for every pair of nodes in the graph. We focus on SC graphs because these graphs traditionally present the greatest challenges when they are mapped to physical realizations due to the feedback present in the graphs. An example of an SC DFG is the fifth-order wave digital elliptic filter [3] in Fig. 15 which is commonly used as a benchmark for demonstrating high-level synthesis techniques. Scheduling consists of assigning execution times to the operations in a DFG such that the precedence constraints of the DFG are not violated. A great deal of literature exists on the topic of scheduling in the context of high- level synthesis for ASIC design for DSP applications [4]–[20]; however, none of these works gives a formal definition of scheduling along with systematic techniques for exhaustively generating the solutions to the scheduling problem. Integer linear programming (ILP) techniques use formal definitions of scheduling [10], [11], [17], but these techniques generate only one solution. This paper presents new scheduling formulations and algorithms for exhaustively generating all of the solutions to the scheduling problem. Generating all scheduling solutions is a theoretically interesting result which can also have prac- tical applications. Two scheduling problems are considered in this paper, namely, scheduling for time-multiplexed execution on bit-parallel architectures and scheduling for execution on bit-serial architectures. Retiming consists of moving delays around in a DFG without changing its functionality. As with scheduling, there is a huge body of literature on retiming, and new applications for retiming are constantly being found. For example, due to the recent demand for low-power digital circuits in portable devices, some recent work has focused on retiming for power minimization [21]. The groundbreaking paper on retiming [1] describes algorithms for tasks such as retiming to minimize the clock period and retiming to minimize the number of registers (states) in the retimed circuit. An approach to retiming which is based on circuit theory can be used to generate all retiming solutions for a DFG [22]. This approach was the motivation for our work on exhaustive scheduling. In this paper, we show 1057–7130/98$10.00 1998 IEEE
Transcript
Page 1: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998 821

Exhaustive Scheduling and Retimingof Digital Signal Processing Systems

Tracy C. Denk,Member, IEEE,and Keshab K. Parhi,Fellow, IEEE

Abstract—Scheduling and retiming are important techniquesused in the design of hardware and software implementations ofdigital signal processing algorithms. In this paper, techniques aredeveloped for generatingall scheduling and retiming solutionsfor a strongly connected data-flow graph, allowing a designer toexplore the space of possible implementations. Formulations aredeveloped for two scheduling problems. The first scheduling prob-lem assumes a bit-parallel target architecture. The formulationfor this problem is general because it considers retiming the data-flow graph as part of scheduling, and this formulation reduces tothe retiming formulation as a special case. The second schedulingproblem assumes a bit-serial target architecture. Based on theseformulations, the conditions for a legal scheduling solution arederived, and a systematic technique is presented for exhaustivelygenerating all legal scheduling solutions for a strongly connecteddata-flow graph. Since retiming is a special case of scheduling, thissystematic technique can also be used for exhaustively generatingall legal retiming solutions. A technique is also developed forexhaustively generating only those bit-parallel schedules whichsatisfy a given set of resource constraints. The techniques forexhaustively generating scheduling and retiming solutions aredemonstrated for several filters. For example, we show thata simple filter such as the biquad has 224 possible retimingsolutions for a latency of one time unit. We also show that afifth-order wave digital elliptic filter has 4.7 million and 580million scheduling solutions for iteration periods of 17 and 18,respectively.

Index Terms—Data flow graphs, high-level synthesis, parallelarchitectures, retiming, scheduling, signal processing, very largescale integration.

I. INTRODUCTION

T IME scheduling and retiming [1] are important toolsused to map behavioral descriptions of algorithms to

physical realizations. These tools are used during the design ofsoftware for programmable digital signal processors (DSP’s),during high-level synthesis of application-specific integratedcircuits (ASIC’s), and during the design of reconfigurablehardware such as field-programmable gate arrays (FPGA’s).Time scheduling and retiming operate directly on a behavioraldescription of the algorithm, such as a data-flow graph (DFG).Since the decisions made at the algorithmic level tend to havegreater impact on the design than those made at lower levels,

Manuscript received May 7, 1996; revised December 11, 1997. This workwas supported by the Advanced Research Projects Agency and the Solid StateElectronics Directorate, Wright-Patterson AFB, under Contract AF/F33615-93-C-1309. This paper was recommended by Associate Editor B. A. Shenoi.

T. C. Denk was with Bell Laboratories, Lucent Technologies, Holmdel, NJ07733 USA. He is now with Broadcom Corporation, Irvine, CA 92618 USA.

K. K. Parhi is with the Department of Electrical and Computer Engineering,University of Minnesota, Minneapolis, MN 55455 USA.

Publisher Item Identifier S 1057-7130(98)05058-7.

the importance of time scheduling and retiming cannot beoverstated.

This paper presents new formulations of the time schedulingand retiming problems, and, based on these formulations, newtechniques are developed to determine the solutions to theseproblems [2]. (From this point forward, we shall refer totime schedulingas simplyscheduling.) These formulations arevalid for strongly connected (SC) graphs, where a stronglyconnected graph has a path and a path for everypair of nodes in the graph. We focus on SC graphs becausethese graphs traditionally present the greatest challenges whenthey are mapped to physical realizations due to the feedbackpresent in the graphs. An example of an SC DFG is thefifth-order wave digital elliptic filter [3] in Fig. 15 which iscommonly used as a benchmark for demonstrating high-levelsynthesis techniques.

Schedulingconsists of assigning execution times to theoperations in a DFG such that the precedence constraintsof the DFG are not violated. A great deal of literatureexists on the topic of scheduling in the context of high-level synthesis for ASIC design for DSP applications [4]–[20];however, none of these works gives a formal definition ofscheduling along with systematic techniques for exhaustivelygenerating the solutions to the scheduling problem. Integerlinear programming (ILP) techniques use formal definitions ofscheduling [10], [11], [17], but these techniques generate onlyone solution. This paper presents new scheduling formulationsand algorithms for exhaustively generating all of the solutionsto the scheduling problem. Generating all scheduling solutionsis a theoretically interesting result which can also have prac-tical applications. Two scheduling problems are considered inthis paper, namely, scheduling for time-multiplexed executionon bit-parallel architectures and scheduling for execution onbit-serial architectures.

Retiming consists of moving delays around in a DFGwithout changing its functionality. As with scheduling, thereis a huge body of literature on retiming, and new applicationsfor retiming are constantly being found. For example, due tothe recent demand for low-power digital circuits in portabledevices, some recent work has focused on retiming for powerminimization [21]. The groundbreaking paper on retiming [1]describes algorithms for tasks such as retiming to minimize theclock period and retiming to minimize the number of registers(states) in the retimed circuit. An approach to retiming whichis based on circuit theory can be used to generate all retimingsolutions for a DFG [22]. This approach was the motivationfor our work on exhaustive scheduling. In this paper, we show

1057–7130/98$10.00 1998 IEEE

Page 2: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

822 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998

that retiming is a special case of scheduling, and consequently,the formulation of the scheduling problem and the techniquesfor exhaustively generating the scheduling solutions can alsobe applied to retiming.

The impact of the formulations derived in this paper are asfollows.

• The interaction between retiming and scheduling is im-portant [7], and our formulations provide a simple wayof observing this interaction.

• We show that retiming is a special case of scheduling.• We derive mathematical descriptions of the scheduling

and retiming problems in a common framework.• We develop techniques for generating all solutions to a

particular scheduling or retiming problem. This gives adeveloper the ability to search the design space for thebest solution, particularly when various parameters aredifficult to model and include in a cost function. This hasapplications to software design, ASIC design, and designfor reconfigurable hardware implementations.

• Our formulations provide for a better understanding ofscheduling and retiming which can be used to developnew heuristics for these problems.

Many of the results in this paper rely upon graph theory.Section II gives a review of some results from graph theoryalong with an algorithm for finding the linearly independentloops in an SC directed graph. Our formulations for schedulingto bit-parallel and bit-serial architectures are given in SectionIII along with an explanation of how retiming can be viewed asa special case of scheduling. Section IV contains the descrip-tion of a systematic technique used to exhaustively generatethe scheduling and retiming solutions. Section V describes twotechniques for exhaustively generating the schedules whichsatisfy a given set of resource constraints for a bit-parallelarchitecture. Section V includes the results of schedulingthe fifth-order wave-digital elliptic filter in Fig. 15 with andwithout resource constraints. Our conclusions are given inSection VI.

II. REVIEW OF GRAPH THEORY

This section provides a brief review of graph theory. Mostof the definitions and results in this section can be found in[23].

In this paper, we are concerned only withdirected graphs.A directed graph is represented as , wherethe following is true.

• is the set of vertices (nodes) of . The verticesrepresent computations. The number of nodes inis .

• is the set of directed edges of. A directed edgefrom node to node is denoted as .The edges represent communication between the nodes.The number of edges in is .

• is the number of delays on the edge, also referredto as theweight of the edge.

• is the computation time of the node.

A directed path is denotedas . A simple path is a path with distinct edges, andan elementary path has distinct nodes. A cycle is a closed

path (i.e., ). A simple cycle has distinct edges and anelementary cycle has distinct nodes. An elementary cycle in adirected graph will be referred to as a “loop” in this paper.

A directed graph isstrongly connectedif, for every pair ofvertices , there exists a path and . Adirected spanning tree is a subgraph ofwhich has a rootnode and a path for all except . Thedirected spanning tree contains no cycles. A directed spanningtree contains exactly nodes and edges. An edgeof a directed spanning tree is called a branch, and the edgesof not included in the tree are called links. Every SC graphcontains a directed spanning tree.

An edge from to is incident with verticesand . More specifically, is incidentfrom and incident

into .The set operations such as union, intersection, difference,

complement, etc., are operations on theedgesof a graph. Letand be two subgraphs of a connected graph.

consists of all edges in or (or both) and the verticesincident with these edges. is formed by removing alledges in from , and then removing all vertices with noincident edges.

Graphs can be represented using matrices. In this paper,vectors are represented using bold lowercaseand matricesare represented using bold uppercase. The th element of

is denoted as or , and the th element of isdenoted as . The matrix full of zeros is denoted as

, or simply as when its dimensions do not need to beexplicitly stated, and similar notation is used for thematrix full of ones.

Let be theoriented incidence matrixof . This matrix,which has dimensions , is defined as

is incident fromis incident intois not incident with

and . Thereduced oriented incidence matrixis defined to be any rows of . has dimensions

and .Let be thefundamental loop matrixof the SC graph .

This matrix, which has dimensions , isdefined as

if edge is in loopotherwise

The rows of are linearly independent, andthe loops in which are represented by theserows are referred to as thelinearly independent loopsin .The remaining loops in , which are not represented by therows of , are said to be thelinearly dependent loopsinbecause these loops can be represented as linear combinationsof the rows of . An SC graph contains exactlylinearly independent loops.

Two important relationships between the fundamental loopmatrix and the oriented incidence matrix are and

.Each of the rows of the fundamental loop

matrix corresponds to a linearly independent loop. Algo-

Page 3: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

DENK AND PARHI: EXHAUSTIVE SCHEDULING AND RETIMING OF DIGITAL SIGNAL PROCESSING SYSTEMS 823

rithm Find Fundamental Loops (FFL) below can be used tofind the linearly independent loops of a stronglyconnected graph . Let be a directed spanning tree of,where is the root node of , i.e., there is a pathfor all except .

Algorithm FFL

;FOR TO

STEP 1: a link in which is

incident into ;

STEP 2:loop loop in whichcontains ;STEP 3: loop ;

The loops denoted as, form a basis for the loops in the strongly

connected graph .Algorithm FFL maintains a subgraph which initially

consists of the root node of the directed spanning tree.During iteration , a link in which is incidentinto a node in is chosen in STEP 1. In STEP 2, a loopdenoted as is found in the subgraph consisting of thelink and the edges in . is then updated atthe end of the iteration. The most computationally demandingstep in this algorithm is finding in STEP 2, whichrequires operations [24]. Therefore, AlgorithmFFL can be computed in polynomial time.

We construct the fundamental loop matrix by lettingfrom Algorithm FFL be the th row of . The edges

in the graph are numbered such that the first columnsof correspond to the branches of the spanning tree of, andthe remaining columns correspond to the links.The link is assigned to the th column of .By constructing the fundamental loop matrix in this manner,it has the form

(1)

where is an matrix and is anlower triangular matrix with

ones on the diagonal. Note that the columns ofcorrespondto the links of while the columns of correspond to thebranches of . Because of its form, has rank .

Adding more loops of to (adding a loop would consistof adding a row to ) does not increase its rank. Therefore,the rows of form a basis for the loops of .

An alternative method of constructing is to find all of theloops in , using an algorithm such as the one given in [24],and then choosing linearly independent loopsfrom these as the rows of .

Example 2.1:This example uses Algorithm FFL to formthe fundamental loop matrix for the graph in Fig 1. Thespanning tree with node 1 as the root node is shown inFig. 2(a). At the start of Algorithm FFL is node 1. During

Fig. 1. An SC graph. The branches of a spanning tree are shown with solidlines, while the links of the corresponding cotree are shown with dashed lines.

iteration , the only possibility for link is edge 6. Theonly possibility for is . iscircled in Fig. 2(b). During iteration , there are twopossibilities for link , namely, edges 7 and 8. Choosing edge7 as results in . is circledin Fig. 2(c). During iteration , the two possibilities forlink are edges 8 and 9. Choosing edge 8 asresults in

. is circled in Fig. 2(d).During iteration , link is edge 9, and is

. The fundamental loop matrix is

Note that has the desired form as given in (1). Rowcorresponds to from Algorithm FFL and columncorresponds to edge of .

III. SCHEDULING AND RETIMING FORMULATIONS

Time scheduling (or simply scheduling) consists of as-signing execution times to the operations in a DFG suchthat the precedence constraints of the DFG are not violated.This section considers two scheduling problems, namely,scheduling to a time-multiplexed bit-parallel target architecture(we call thisbit-parallel scheduling) and scheduling to a bit-serial target architecture (we call thisbit-serial scheduling).It turns out that the bit-parallel and bit-serial schedulingformulations are quite similar, and the retiming formulationis a special case of bit-parallel scheduling.

A. Bit-Parallel Scheduling

In bit-parallel scheduling, a DFG is statically scheduled toa bit-parallel target architecture. The scheduling formulationpresented in this section is based on the folding equationdeveloped in [25].Folding is the process of executing severalalgorithm operations on a single hardware module.Schedulingis the process of determining at which time units a givenalgorithm operation is to be executed in hardware.

Before the scheduling formulation is developed, we need abrief description of retiming. The basic retiming equation forthe edge is [1]

(2)

Page 4: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

824 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998

(a)

(b)

(c)

(d)

Fig. 2. The four steps of Algorithm FFL which finds the four fundamentalloops of the graph shown in Fig 1. For each iterationk, the subgraphG(k)

R

is circled.

where is the number of delays on the edge beforeretiming, is the number of delays on the edge afterretiming, and and are the retiming values of nodes

and , respectively.The notions of an iteration and an iteration period are used

in this section. Aniteration is defined as the execution of eachnode in the DFG exactly once. Theiteration periodis definedas the number of clock cycles used to execute one iterationof the DFG in hardware.

Consider an edge from node to node , denoted as. The operations (nodes) in the DFG are scheduled to

be executed in the folded architecture once everyclockcycles, where is the iteration period. Let theth iteration ofnodes and be executed in hardware at time unitsand , respectively, where and are the timepartitions to which the nodes are scheduled to execute such that

. Let edge have delays, whichmeans that the result of theth iteration of node is used bythe th iteration of node . The hardware moduleswhich execute nodes and are denoted as and ,respectively. If is pipelined by stages, then the resultof the th iteration of node is available at .This sample is used by the th iteration of node

, which is executed by at , so thesample must be stored for

clock cycles. Substituting for using (2) gives

(3)

The edge with delays in the DFG maps to an edgefrom to with delays in the architecture, and thedata on this edge are switched into at time units .

Note that we assume that the hardware module ispipelined by delays, where is the computation timeof the node in the DFG. If we define an vectorwhose th element is the computation time of the source nodeof edge (the source node of an edge is the node that the edgeis incident from), then the folding equation can be written forall edges of the DFG simultaneously using

(4)

where is the incidence matrix for the graph(see Section II), is the time partition vectorwhichassigns node to the time partitionis the retiming vectorwith the retiming values of thenodes in is and contains the number of delays oneach edge of is the folding vectorwhich containsthe number of delays on each edge of the folded architecture,and is the delay vector as previously described.This formulation of folding is general because it relies uponthe retiming solution and the time partition vector . Oneway to view this is that the DFG is preprocessed using retiming(hence the vector) and then scheduling is perfomed on theretimed DFG (hence, the vector). Combining and using

Page 5: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

DENK AND PARHI: EXHAUSTIVE SCHEDULING AND RETIMING OF DIGITAL SIGNAL PROCESSING SYSTEMS 825

results in theschedule vector . Using , thescheduling problem can be written as

(5)

The rank of the incidence matrix is .Therefore, the left nullspace of must consist of a vectorwhich satisfies . We can see thatbecause each column of contains exactly one entry whichis a 1, one entry which is a , and the remaining entriesof the column are zero.

Using the relationship , we can write

which means that adding the constantto each element of theschedule vector does not change the number of delays on theedges of the folded architecture.

The incidence matrix can be written as

The reduced incidence matrix consists of any rows of. Removing row of results in

(6)

The reduced incidence matrix has dimensionsand rank . The reduced scheduling vector is defined

as

(7)

which can be written as , where andare the time partition vector and the retiming vector withthe th elements removed. Using and , we can write

Substituting this into (5) results in

(8)

Node is called thereference node. Since replacing bydoes not alter the resulting folded architecture,

we can choose so . After replacingwith , (8) becomes .

Throughout the remainder of this paper, we will assume thatso . In an abuse of notation, we

will refer to simply as so that (5) can be written as

(9)

Lemma 3.1: Equation (9) can be solved for if and onlyif .

Proof: Equation (9) has a solution if and only ifis in the dimensional row space of .

Equivalently, (9) has a solution if and only if isperpendicular to the dimensional nullspace ofbecause the nullspace is the orthogonal complement of the rowspace in the space of all vectors with real components.Since , the rows of the fundamentalloop matrix form a basis for the nullspace of . Therefore,(9) has a solution if and only if .

To understand the meaning of , webegin by writing as

such that is the th row of . Using this,implies . Recall that

if edge is in loop and otherwise. Therefore,is the total number of folded delays on loop, and

is a constant that depends on. The equationstates that the number of folded

delays on loop is the same for any legal folding vector,and implies that this is true for all

linearly independent loops of represented bythe rows of . Furthermore, the sum of the number of foldeddelays for all edges and pipelining delays associated with allnodes of a loop is the product of the folding factorand thenumber of loop delay elements, as noted in [25]. It can alsobe shown that this holds for the dependent loops of, i.e.,the number of folded delays on each loop ofthat is notrepresented by a row of is the same for any legal foldingvector .

If holds, (9) has exactly one solutionfor , which is given by

(10)

The above discussion can be summarized by saying that thenumber of folded delays on each loop in is the same forany valid schedule .

In addition to the condition there is alsothe practical condition that the number of delays on an edgein the folded architecture must be nonnegative. This conditioncan be written as . The constraints for a valid scheduleare:

1) ;2) .

B. Retiming

Retiming is the process of moving delays around in a circuitwithout changing the functionality of the circuit [1]. A briefdescription of retiming is given in Section III-A. This sectiondescribes how retiming can be viewed as a special case ofbit-parallel scheduling.

The folding equation for a graph is given in (4). If eachnode in represents a hardware operator, then all operationsin the graph are executed in a single clock cycle resulting in aniteration period of . The elements of the time partitionvector are all zero because time partition zero is the only

Page 6: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

826 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998

available partition. If we let , i.e., we do not considerany internal pipelining of the operators, (4) becomes

which simplifies to

(11)

Since is the number of delays in the folded architecture,is equivalent to for , so (11) becomes

(12)

which is simply the matrix notation for writing (2) simul-taneously for all edges of the graph. This demonstrates thatretiming is simply scheduling when the iteration period isunity.

Using , (12) can be written as

If is a retiming vector which maps the graphto the retimedgraph , then so is for any integer .

In the context of retiming (i.e., assuming, and ), (9) can be written as

(13)

Recall that (9) assumes that . Since andis assumed to obtain (13), this implies that

in (13). In other words, the retiming value of the referencenode is 0 in this formulation.

The translation of Lemma 3.1 to the retiming context is that(13) has a solution if and only if holds. Thisimplies that the number of delays on any loop inremainsunchanged during retiming, as noted in [1]. Ifholds, (13) has exactly one solution for , which is given by

(14)

In addition to the condition , there is also thepractical condition that the number of delays on an edge inthe retimed graph must be nonnegative. This condition can bewritten as . The conditions for a valid retiming from

to are:

1) ;2) .

C. Bit-Serial Scheduling

In this section, a scheduling formulation is developed wherethe target architecture is a bit-serial architecture. This formu-lation, which is similar to the formulation in [26, Ch. 6], hasthe same general form as the retiming and the bit-parallelscheduling formulations in Sections III-A and III-B.

A bit-serial operator is often represented using a timingdiagram such as the one in Fig. 3. Let the execution of operator

in this figure begin at time . The first bit of each ofthe inputs , and arrives at time units ,

, and , respectively. The first bit of each ofthe outputs and is produced at time units and

, respectively. In other words, the timing diagramgives the relative differences between the timing of the inputand output samples of the operator.

Fig. 3. The timing diagram for the bit-serial operatorA.

(a)

(b)

Fig. 4. (a) The architecture for a bit-serial adder for wordlength ofW . (b)The timing diagram for this architecture.

Fig. 5. An edgeue

! v with wr(e) delays.

Example 3.1For the bit-serial adder in Fig. 4(a), whichcomputes , the timing diagram is shown inFig. 4(b). Note that is the wordlength.

The constraints for the bit-serial scheduling problem can bederived using the timing diagram. Consider the edgewith delays in Fig. 5. The output of iterationof isused as the input of iteration of . Let the th iterationof nodes and begin execution at time units and

, respectively, where is the data wordlength andand are the time partitions to which the nodes are

scheduled to execute such that . Theoutput of the th iteration of is available atand the output of the th iteration of is consumed at

, so the result must be stored for

clock cycles.This equation can be written for all edges of the graph

simultaneously according to

(15)

Page 7: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

DENK AND PARHI: EXHAUSTIVE SCHEDULING AND RETIMING OF DIGITAL SIGNAL PROCESSING SYSTEMS 827

where:

• is the incidence matrix for the graph;• is the time partition vector which assigns nodeto the

time partition where ;• is defined such that is the value at the source

of edge in the graph;• is defined such that is the value at the sink of

edge in the graph;• contains the number of delays on each edge of the

retimed DFG;• contains the number ofserial delayson each edge of

the hardware implementation.

The bit-serial folding (15) operates on the retimed DFG.Substituting (12) into (15) results in

Combining and using results in

This equation can be rewritten as

(16)

where and are defined as in (6) and (7), and thescheduling value for the reference node is .

Using the same argument as in Lemma 3.1, it can beshown that the bit-serial scheduling (16) has a solution ifand only if . The equation

states that the sum of the serialdelays in any loop of the hardware implementation is the samefor any valid serial delay vector. In addition, the sum of thenumber of serial delay elements of all edges and latenciesassociated with all nodes in a loop is the same as the productof the word length and the number of loop delay elements.

A second constraint, , exists because a connectionin hardware cannot have a negative number of delays. Theconstraints for a valid bit-serial schedule are:

1) ;2) .

The value of the schedule vectorcan be found using

(17)

IV. GENERATING ALL SCHEDULING

AND RETIMING SOLUTIONS

A. Generating All Bit-Parallel Scheduling Solutions

Based on the two constraints and, all scheduling solutions for a strongly connected DFG

can be generated. A systematic technique for generating thesesolutions is presented in this section.

Recall that is the fundamental loop matrix which canbe expressed as , where is an

matrix and is anlower triangular matrix with ones on the diagonal.

The columns of correspond to the branches of the spanningtree of which is chosen before Algorithm FFL is used tofind , and the columns of correspond to the links of . Therows of correspond to linearly independentloops in .

The algorithm for generating all scheduling solutions re-quires an interval to be written for the folded weight of eachbranch of and an equality to be written for the folded weightof each link of . The interval for the folded weight of abranch gives the range of possible values for the number offolded delays for this branch in the folded architecture. Theequality for the folded weight of a link gives an expression forthe number of delays for the link in the folded architecture.Using these intervals and equalities, code can be constructedto generate all possible scheduling solutions.

To determine these intervals and equalities, the elementsof the fundamental loop matrix are examined one-by-one in arow-by-row manner, starting at the top-left of the matrix. Eachtime a “1” is encountered in the submatrix of such thatthis “1” is the first “1” encountered in its column, an intervalis specified for this branch. This interval, which representsthe range for the number of folded delays for the branch inthe folded architecture, takes into account the intervals andequalities previously determined in the row-by-row scan of.

Assume that the first “1” in column of is in row , i.e.,and for all . Let denote any row of

such that , i.e., is a fundamental loop thatcontains the edge. Since is the first “1” in column

must hold, i.e., is in row or in arow which is below row . From , we get

(18)

Let denote the set of edges encountered before reaching theelement in the row-by-row scan of . Mathematically,is the set of edges such that there exists an elementsuch that . Using , we canrewrite (18) as

(19)

The intervals and equalities for the edges in the sethave not yet been determined; however, we do know from

that . Using this in (19)results in

Page 8: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

828 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998

Using this along with specifies the interval for

(20)

which must hold for all such that .Because the matrix in is lower triangular

with ones on the diagonal, the diagonal element of row, is always the first “1” encountered in column of

during the row-by-row scan of . In addition to usingto denote this element, it can also be denoted as where

. When is encountered in the row-by-rowscan of such that , an equality is writtenfor based on the equation . Thisequality, which uses the fact that the intervals and equalitieshave already been determined for all edges in exceptedge , is

(21)

To summarize the above discussion, the matrixis scannedin a row-by-row manner starting with . When isencountered, if is the first “1” in its column of , theinterval in (20) is written for all such that . When

is encountered where , the equalityin (21) is written.

The intervals for the branches of are denoted asfor . An algorithm written in pseudocode

for determining these intervals for the branches andthe equalities for the links is given below.At any point in this algorithm, is the set of edges inwhose intervals or equalities have previously been determined.Comments for the algorithm are written using the conventionof the C programming language. See Algorithm IE at the

Fig. 6. The data flowgraph used in Example 4.2.

bottom of the page, where

ifotherwise.

From the intervals and equalities, code can be writtento enumerate all possible scheduling solutions. The generalstructure of the code is the following.

1) Write FOR loops for the intervals and write assignmentstatements for the equalities in the same order that theseintervals and equalities are generated in Algorithm IE.

2) Test the link weights for nonnegativity. If the linkweights pass this test, the edge weights represent a validscheduling solution.

This technique generates all possible scheduling solutionsbecause the FOR loop for branchassigns every integervalue which is legal under the constraintsand , while taking into consideration the values ofwhich are already contained in a FOR loop or an assignmentstatement.

Example 4.1: In this example, we find all scheduling solu-tions for the DFG in Fig. 6, assuming an iteration period of 4and assuming that the computation time for each node is unity.

(Intervals and Equalities)

FOR TO /* for each row */

FOR ( TO ) /* for each column */

IF ( AND ) /* if is the first “1” in column */

IF /* if is in submatrix */

; /* write interval */; /* add edge to */

ELSE /* if is in submatrix */

; /* write equality */; /* add edge to */

Page 9: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

DENK AND PARHI: EXHAUSTIVE SCHEDULING AND RETIMING OF DIGITAL SIGNAL PROCESSING SYSTEMS 829

TABLE ITWELVE VALID SCHEDULING SOLUTIONS FOR THE DFG IN FIG. 6

and

Using Algorithm IE gives the intervals and equalities

The code for finding all scheduling solutions is shown at thebottom of the page.

There are 12 scheduling solutions for this DFG. The sched-uling vector can be computed from the folded edge vectorusing (10). Using node 1 as the reference node, the folded edgeweights and the scheduling values for the nodes are listed inTable I.

for ; ;for ; ;

;for ; ;

;for ; ;

;for ; ;

;if AND AND AND

print the values of f1 through f9 and s1 through s6

Page 10: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

830 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998

Once all possible vectors have been found and thecorresponding vectors have been computed using (10), the

and vectors can be found from(recall that )using and . It can be shown that theseexpressions for and result in the following:

• . This means that is indeed a timepartition satisfying .

• and . This means that is a validretiming solution of .

To summarize, the following four steps can be used to findall valid schedules for an SC DFG.

1) Find all vectors such that and.

2) Compute using (10) and , where is thereference node.

3) .

4) .

These four steps give the valid schedules for. The retimingvector corresponds to a valid retiming solution for, andthe elements of the partition vectorsatisfy .

For each legal folding vector, the technique in this sectionfinds exactly one schedule, which contains information aboutthe time partitions and the retiming values of the nodes.However, there are actually schedules which map the DFGto a folded architecture which hasdelays on its edges. Wecall these solutionsequivalent schedules, and we call thesolution found using Step 2 above thefundamental scheduleof the folding vector . The equivalent schedules arefor . Replacing by has two effects.First, the switching instance in thefolded architecture becomes . Second, ifscheduling is viewed as preprocessing the DFG by retiming(finding ) and then assigning time partitions (finding), thepreprocessed DFG may change becausemay change. A niceproperty of the technique presented in this section is that it

finds the fundamental schedulefor each folding vector ,and the equivalent schedules are implicitly known to be

for .

B. Generating All Retiming Solutions

Since retiming is a special case of scheduling, the techniquesin Section IV-A for generating all scheduling solutions canalso be used to generate all retiming solutions by replacingwith and letting and .

Example 4.2: In this example, we generate the edge inter-vals and equalities for the graph in Fig. 6. The fundamentalloop matrix for this graph is

the weight vector is

and . The intervals and equalities aregenerated in the following order using Algorithm IE:

Using these intervals and equalities, the code which gener-ates all retiming solutions for the DFG in Fig. 6 is shown atthe bottom of the page. Note thatx is used to represent .

for ; ;for ; ;

;

for ; ;

;

for ; ;

;

for ; ;

;

if AND AND ANDprint the values of x1 through x9 and r1 through r6

Page 11: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

DENK AND PARHI: EXHAUSTIVE SCHEDULING AND RETIMING OF DIGITAL SIGNAL PROCESSING SYSTEMS 831

TABLE IITWELVE VALID RETIMING SOLUTIONS FOR THEDFG IN FIG. 6

(a) (b)

Fig. 7. (a) The biquad filter. This graph is not strongly connected. (b) A modified version of the biquad filter. This graph is strongly connected.

There are twelve retiming solutions for the DFG. Theretiming vector is computed from the retimed weight vector

using (14) and , where node 1 is the referencenode. The retimed edge weights and the retiming values forthe nodes are listed in Table II.

If a DFG is not strongly connected, it is possible to addedges to the DFG to make it strongly connected so allretiming solutions can be generated. Consider the biquad filterin Fig. 7(a). This graph is not strongly connected because,for example, there is no path from the output node to theinput node. To make this graph strongly connected, it can bemodified by adding an edge from the output node to the inputnode as shown in Fig. 7(b). The modified graph has a newloop which has one delay. This loop forcesthe latency of the DFG to be one cycle. Using the techniquespresented in this section, we find that there are 224 retimingsolutions for the DFG in Fig. 7(b).

As another example, consider the correlator in Fig. 8 whichis used to demonstrate retiming in [1]. Using the techniquespresented in this section, 143 retiming solutions can be foundfor this DFG. This result was also reported in [22].

Fig. 8. The correlator example which has 143 retiming solutions.

Fig. 9. A third-order all-pole IIR filter.

Page 12: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

832 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998

(a)

(b)

(c)

Fig. 10. The circuits and timing diagrams for the three multipliers (a)�1/4, (b) 1/8, and (c) 1/2 in Fig. 9.

C. Bit-Serial Scheduling

Since the bit-serial scheduling formulation has the sameform as the bit-parallel scheduling formulation, the techniquesused to generate all bit-parallel scheduling solutions can beused to generate all bit-serial scheduling solutions by replacing

with and replacing with .The values of and can be computed from (recall that

) using and . It can be shownthat these expressions forand result in the following.

• . This means that is indeed a timepartition satisfying .

• and if for all edges ,as shown in Fig. 5. This means thatis a valid retimingsolution of when for all .

Example 4.3: In this example, we generate all possibleschedules for the bit-serial implementation of the third-orderall-pole filter shown in Fig. 9 assuming two’s complementnumber representation, data wordlength is 8 (i.e., 8),and coefficient wordlength is 4.

The first step is to determine the timing diagram for each op-erator. The circuit and timing diagram for an adder are given inFig. 4. The circuits and timing diagrams for multiplication by

, and are given in Fig. 10(a)–(c), respectively.

Fig. 11. The timing diagram for the filter in Fig. 9. The edge labels areshown in parentheses to avoid confusion with the timing values.

Using these subcircuits, the timing diagram for the filter isshown in Fig. 11.

The fundamental loop matrix is

In addition, we have, and . The equation

Page 13: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

DENK AND PARHI: EXHAUSTIVE SCHEDULING AND RETIMING OF DIGITAL SIGNAL PROCESSING SYSTEMS 833

Fig. 12. A bit-serial architecture for the third-order all-pole filter. This architecture uses the minimum number of registers (20), not including the registerswhich are internal to the processing units.

is

The intervals and equalities are

There are 6103 valid scheduling solutions. To avoid exam-ining all of these solutions, let us examine only those solutionswhich use the minimum number of serial registers. The numberof serial registers is

The minimum number of registers for all 6103 valid schedul-ing solutions is 20, and there are 326 solutions whichuse 20 registers. One solution that uses 20 registers is

The complete architecture for this solution is shown inFig. 12. This architecture uses 20 registers, not including theregisters which are internal to the processing units.

V. BIT-PARALLEL SCHEDULING WITH

RESOURCE CONSTRAINTS

When all of the schedules are generated for a DFG, thismay include many schedules which require more hardwareresources than are available for the implementation. In thissection, we describe two methods for finding the scheduleswhich satisfy a given set of resource constraints. In the firstmethod (thesolution-save method), we generate all schedulingsolutions and then save only the solutions which satisfy theresource constraints. In the second method (thesolution-

Page 14: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

834 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998

TABLE IIITHE SIX VALID SCHEDULING SOLUTIONS FOR THEBIQUAD FILTER WHICH USE ONE ADDER AND ONE MULTIPLIER FOR AN ITERATION PERIOD OF 4

generate method), we only generate those scheduling solutionswhich satisfy the resource constraints.

A. The Solution-Save Method

The number of hardware modules required by a scheduledDFG can be determined from. For example, let bethe number of multiplication operations scheduled to timepartition , and let be the num-ber of addition operations scheduled to time partition.Then the number of multipliers required by the scheduleis and the number of adders is

.Example 5.1: In this example we find all scheduling so-

lutions which require one multiplier and one adder for thebiquad filter in Fig. 7(b) assuming an iteration period of

and assuming that addition and multiplication require1 and 2 units of time, respectively. Nodes 1, 2, 7, and 8 areaddition operations and nodes 3, 4, 5, and 6 are multiplicationoperations.

The fundamental loop matrix is

and . The intervals and equalitiesare

There is a total of 625 valid scheduling solutions for thisexample; however, only six of these solutions use only one

adder and one multiplier. Table III gives the folding vectorsand schedule vectors for these solutions where node 1 hasbeen arbitrarily chosen as the reference node, which forces

in all six solutions. Adding 3 to each element of theschedule vector in the fourth solution in Table III results in

. This new schedule vectorcorresponds to the folding vector for the fourth solution inTable III (recall from Section III-A that adding a constantvalue to each element of the schedule vectordoes not changethe number of delays on the edges in the folded architecture).This schedule vector is used to fold the biquad filter in [25,Example 11], and the resulting folded architecture is given in[25, Fig. 12(d)].

Example 5.2: Consider the four-stage pipelined eighth-order all-pole lattice filter in Fig. 13. Edge 11 has been addedto this filter to make it strongly connected. For the iterationperiod 2, this filter has 450 scheduling solutions, and99 of these schedules use two adders and two multipliers,where addition and multiplication are assumed to require oneand two units of time, respectively. Of these 99 schedules,the minimum possible number of registers required for theimplementation is ten, and only two of these 99 schedules useten registers. These schedules areand . The minimum number ofregisters is computed using the techniques in [27] with themodification that the results reported here assume that, for aprocessor that is pipelined by stages, the pipeliningregisters cannot be used by output samples from otherprocessors, while the results in [27] allow one pipeliningregister to be shared by other processors. For the iterationperiod , the filter in Fig. 13 has 910 910 schedulingsolutions, and 10 083 of these schedules use one adder and onemultiplier. Of these 10 083 schedules, the minimum possiblenumber of registers required for the implementation is 11, and21 of these 10 083 solutions use 11 registers.

B. The Solution-Generate Method

This section describes a technique for exhaustively generat-ing only the bit-parallel schedules which can be implementedon a given set of hardware resources. Using this technique,we can avoid generating those schedules which use moreresources than are available, and this allows us to generate thedesirable schedules in considerably less time. The followingtheorem, which is stated without proof, is needed so we canconstruct in a manner that allows us to perform exhaustivebit-parallel scheduling with resource constraints.

Page 15: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

DENK AND PARHI: EXHAUSTIVE SCHEDULING AND RETIMING OF DIGITAL SIGNAL PROCESSING SYSTEMS 835

Fig. 13. The four-stage pipelined eighth-order all-pole lattice filter. The edge labels are in parentheses to avoid confusion with the node labels. Onepossible spanning tree is shown in solid lines.

Theorem 5.1:In Algorithm FFL, let be the node thatthe link is incident from. If is in , then there areno branches in which are also in . Ifis in , then there are branches in whichare in , and these branches form an elementary

directed path which we shall denote as

.As described in Section II, we construct the fundamental

loop matrix by letting from Algorithm FFL be theth row of . The edges in the graph are numbered such that

the first columns of correspond to the branchesof the spanning tree of , and the remainingcolumns correspond to the links. From Theorem 5.1, we knowthat if there are branches in which are in ,

then these branches form the elementary directed path

. In other words, if containsbranches which have not appeared in previous loops, then thesebranches form a path. These branches are assigned to the nextavailable columns of in the order that they appear in the

path . The link is assignedto the -th column of . By constructing thefundamental loop matrix in this manner, it still has the formgiven in (1); however, it now allows us to use Algorithm IEto determine the schedule values of the nodes directly.

The interval for the scheduling problem is found byenforcing (20) for all such that . Assume that theedge is incident into node and incident from node ,i.e., . From (5), the expression for theth foldededge weight is . Substitutingthis into the interval for gives

for all such that . Solving for gives

for all such that .To avoid confusion with the interval for (recall that

we denoted this as ), the interval for is denoted as

Fig. 14. The graph scheduled in Example 5.3.

. This notation specifies that is an interval for thescheduling value of the node that edgeis incident into. Let

. Then the interval is simply theinterval from Algorithm IE with added to the lower andupper bounds. We shall denote this as .

Using the technique described in this section for construct-ing the fundamental loop matrix , Algorithm IE can be usedto determine the intervals for the folded edge weights, andthe intervals for the scheduling values for the nodes can befound using .

Example 5.3: In this example, all possible scheduling so-lutions are generated for the DFG in Fig. 14 for an iterationperiod of 4 by generating the solutions fordirectly. Thecomputation time for each node is assumed to be unity. Usingthe technique described in this section for constructingresults in

(22)

Notice that the edge labels in Fig. 14 are different than thoseused in Fig. 6. The labels have been changed so the columnnumbers of in (22) correspond to the edge labels in Fig. 14.Using , the intervals are given in TableIV. Note that in this table hasbeen used to simplify the upper bounds of theintervals.

The code for this example is shown at the bottom of thenext page. The 12 solutions forgenerated from this code arethe same as those listed in Table I.

By determining the values of the schedule vector directlyrather than first determining the folding vector and thencomputing the schedule vector, we can generate only those

Page 16: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

836 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998

Fig. 15. The fifth-order wave digital elliptic filter. The branches of the spanning tree used in Algorithm FFL are shown with solid lines, and the linksare shown with dotted lines.

TABLE IVTHE INTERVALS FOR EXAMPLE 5.3

schedules which can be executed using a limited numberof hardware modules. This is done using a programmingtechnique that avoids the solutions which use more resourcesthan are available. For each operation type (e.g., additionor multiplication), an array of data elements is usedsuch that there is one element for each time partition from

to . Each data element contains the number ofoperations of a given type that is currently scheduled to thattime partition. Each data element also keeps track of thenext time partition in which the hardware resources for thatparticular operation type are not fully utilized. By keepingtrack of this information, when we generate a new scheduleby incrementing the schedule value for a node, the node isscheduled to a time partition in which the hardware resourcesfor the operation are not already fully utilized. The endresult is that we do not generate the schedules that usemore resources than are available, so we can generate all

TABLE VRESULTS OF EXHAUSTIVELY SCHEDULING THE FILTER IN FIG.

15 USING THE TECHNIQUESPRESENTED IN SECTION IV-A

scheduling solutions for a given set of resource constraintsmuch more quickly than if we find all possible schedulesand keep only those schedules which satisfy the resourceconstraints.

The advantages of including the resource constraints aredemonstrated using the fifth-order wave digital elliptic filtershown in Fig. 15. We assume that addition and multiplicationrequire one and two units of time, respectively, and thathardware adders and multipliers are pipelined by one and twostages, respectively. The results of exhaustively generating thescheduling solutions without considering resource constraintsare shown in Table V. The results of exhaustively generatingthe scheduling solutions which can be implemented on a givennumber of hardware adders and multipliers are shown on theleft side of Table VI. From these tables, we can see thatthe time it takes to exhaustively generate only the schedulingsolutions which satisfy a given set of resource constraints isorders of magnitude faster than the time it takes to exhaustively

for ; ;for ; ;for ; ;for ; ;for ; ;

Compute link weights. If all positive, print s1 through s6

Page 17: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

DENK AND PARHI: EXHAUSTIVE SCHEDULING AND RETIMING OF DIGITAL SIGNAL PROCESSING SYSTEMS 837

TABLE VIRESULTS OFEXHAUSTIVELY SCHEDULING THEFILTER IN FIG. 15 FOR A GIVEN SET OF RESOURCECONSTRAINTSUSING THE SOLUTION-GENERATETECHNIQUE PRESENTED

IN SECTION V-B. THE LEFTPART OF THETABLE CONSIDERSSCHEDULING TO THEMINIMUM POSSIBLENUMBER OFADDERS ANDMULTIPLIERS FOR THEGIVEN ITERATION

PERIODS, AND THE RIGHT PART CONSIDERSSCHEDULING TO THE MINIMUM NUMBER OF ADDERS, MULTIPLIERS, AND REGISTERS FOR THEGIVEN ITERATION PERIODS.

generate all scheduling solutions. The expressions in [27] canbe used to compute the number of registers required by a givenschedule. The results of this are shown on the right side ofTable VI. Note that these results assume that internal pipeliningregisters cannot be shared between processors, while theresults in [27] assume that internal pipelining registers canbe shared between processors. The CPU times in Tables Vand VI result from generating the scheduling solutions fromC code running on a Sun Sparcstation 20.

The solution-generate method can exhaustively generate thescheduling solutions for a given number of multipliers andadders much faster than the solution-save method describedin Section 5.1. This is demonstrated by scheduling the four-stage pipelined eighth order all-pole lattice filter in Fig. 13for the iteration period under the constraints that oneadder and one multiplier are available, where the adder ispipelined by one stage and the multiplier is pipelined by twostages. As reported in Example 5.2, there are 10 083 schedulingsolutions. Finding these solutions requires 21 CPU s usingthe solution-save technique and only 0.33 CPU s using thesolution-generate technique. These CPU times result from Ccode running on a Sun Sparcstation 20.

VI. CONCLUSION

Formulations have been presented in this paper for the bit-parallel and bit-serial scheduling problems, and we have shownthat the retiming formulation introduced in [22] is a specialcase of the bit-parallel scheduling formulation. Techniqueshave been developed and demonstrated for exhaustively gen-erating all unique retiming and scheduling solutions for an SCDFG. These techniques allow a circuit designer to explore thespace of possible implementations.

In addition to the technique for exhaustively generating allunique bit-parallel scheduling solutions, the solution-generatetechnique was also developed for exhaustively generating onlythe bit-parallel scheduling solutions which satisfy a given setof resource constraints. Our results indicate that the solution-generate technique can generate schedules in CPU times thatare greater than two orders of magnitude faster than generatingall solutions.

One advantage of the formulations presented in this paper isthat they allow us to understand that retiming and scheduling,which are two problems that have traditionally been viewedas unrelated, are similar. These formulations also show thatretiming is an important part of scheduling. Specifically,retiming is shown to be a special case of scheduling, and

retiming is included in our scheduling formulations to makethem general and to allow the role of retiming to be observedduring scheduling.

Table V shows some scheduling results for the fifth-orderwave digital elliptic filter. Since this filter is often used todemonstrate scheduling techniques, the numbers in these tablesprovide some benchmarks for gauging the effectiveness ofscheduling algorithms. These numbers indicate that the numberof schedules increases dramatically as the difference betweenthe iteration period and the iteration bound becomes larger.While generating all schedules is theoretically interesting, forpractical applications our exhaustive scheduling techniquesare most useful when the iteration period is at or near theiteration bound. The solution-generate algorithm reduces thecomputational complexity required to generate the solutionswhich use a given set of functional units; however, thenumber of schedules and CPU times required to generate theseschedules can still be quite large as shown in Table VI forthe fifth-order wave digital elliptic filter with iteration period

. Future research topics include using the formulationsdeveloped in this paper as a basis for scheduling algorithmswhich use shorter CPU times to find a single optimal schedulebased on speed, area, and power constraints.

Throughout this paper, we have focused on SC graphs.However, many DSP algorithms, such as finite impulse re-sponse (FIR) filters, have DFG’s which are not stronglyconnected. One way to make these DFG’s strongly connectedis to add an edge from the output to the input, as wasdemonstrated for the biquad filter in Section IV-B and the four-stage pipelined eighth-order all-pole lattice filter in SectionV-A. Another way to make a graph strongly connected is toadd a dummy node and add an edge from each output to thedummy node and an edge from the dummy node to each input.Both of these methods add new edges to the DFG, and thedesigner can place delays on these edges. In general, addingdelays to these edges increases the latency of the DFG andincreases the number of delays in some of the loops, which inturn increases the number of retiming and scheduling solutionsfor the DFG. While this provides a qualitative description ofhow adding delays to the input and output edges affects thenumber of retiming and scheduling solutions, a quantitativesolution to this problem is beyond the scope of this paper andis a topic of future research.

This paper has focused on scheduling and retiming of DSPalgorithms which can be represented using data-flow graphs.Some DSP algorithms can only be represented using control-

Page 18: Exhaustive Scheduling And Retiming Of Digital Signal Processing ...

838 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 7, JULY 1998

flow graphs. Using our formulations to develop schedulingtechniques for these algorithms is a topic of future research.Another topic of future research is to include more diverseresource constraints, such as interconnection costs, into thesescheduling techniques.

REFERENCES

[1] C. Leiserson, F. Rose, and J. Saxe, “Optimizing synchronous circuitryby retiming,” in Third Caltech Conf. on VLSI, 1983, pp. 87–116.

[2] T. C. Denk and K. K. Parhi, “A unified framework for characterizingretiming and scheduling solutions,” inProc. IEEE ISCAS, Atlanta, GA,May 1996, vol. 4, pp. 568–571.

[3] P. Dewilde, E. Deprettere, and R. Nouta, “Parallel and pipelined VLSIimplementation of signal processing algorithms,” inVLSI and ModernSignal Processing, S. Y. Kung, H. J. Whitehouse, and T. Kailath, Eds.Englewood Cliffs, NJ: Prentice-Hall, ch. 15, pp. 257–276, 1985.

[4] M. C. McFarland, A. C. Parker, and R. Composano, “The high-levelsynthesis of digital systems,”Proc. IEEE, vol. 78, pp. 301–318, Feb.1990.

[5] T.-F. Lee, A. C.-H. Wu, Y.-L. Lin, and D. D. Gajski, “A transformation-based method for loop folding,”IEEE Trans. Computer Aided Design,vol. 13, pp. 439–450, Apr. 1994.

[6] P. Lippens, J. Van Meerbergen, A. Van der Werf, W. Verhaegh, B.McSweeney, J. Huisken, and O. McArdle, “PHIDEO: A silicon compilerfor high speed algorithms,” inProc. Eur. Conf. on Design Automation,Amsterdam, Feb. 1991, pp. 436–441.

[7] M. Potkonjak and J. Rabaey, “Retiming for scheduling,” inVLSI SignalProcessing IV, Nov. 1990, pp. 23–32.

[8] L.-F. Chao, A. LaPaugh, and E. H. Sha, “Rotation scheduling: A looppipelining algorithm,” in Proc. 30th Design Automation Conf., June1993, pp. 566–572.

[9] M. Potkonjak and J. Rabaey, “Pipelining: Just another transformation,”in Proc. 1994 IEEE Int. Conf. on Application-Specific Array Processors,Oakland, CA, Aug. 1992, pp. 163–177.

[10] C. H. Gebotys and M. I. Elmasry, “Optimal synthesis of high-performance architectures,”IEEE J. Solid-State Circuits, vol. 27, pp.389–397, Mar. 1992.

[11] C. H. Gebotys, “Synthesizing embedded speed optimized architectures,”IEEE J. Solid-State Circuits, vol. 28, pp. 242–252, Mar. 1993.

[12] S. M. Heemstra de Groot, S. H. Gerez, and O. E. Herrmann, “Rangechart guided iterative data-flow graph scheduling,”IEEE Trans. CircuitsSyst. I, vol. 39, pp. 351–364, May 1992.

[13] H. De Man, J. Rabaey, P. Six, and L. Claesen, “Cathedral II: A siliconcompiler for digital signal processing,”IEEE Design & Test, vol. 13,pp. 13–25, Dec. 1986.

[14] H. De Man, F. Catthoor, G. Goossens, J. Vanhoof, J. Van Meerbergen,and J. Huisken, “Architecture driven synthesis techniques for VLSIimplementation of DSP algorithms,”Proc. IEEE, pp. 319–335, Feb.1990.

[15] J. Vanhoof, K. Van Rompaey, I. Bolsens, G. Goossens, and H. De Man,High-Level Synthesis for Real-Time Digital Signal Processing.Boston,MA: Kluwer, 1993.

[16] R. I. Hartley and J. R. Jasica, “Behavioral to structural translation in abit-serial silicon compiler,”IEEE Trans. Computer-Aided Design, vol.7, pp. 877–886, Aug. 1988.

[17] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu, “A formal approach to thescheduling problem in high-level synthesis,”IEEE Trans. Computer-Aided Design, vol. 10, pp. 464–475, Apr. 1991.

[18] J. Rabaey, C. Chu, P. Hoang, and M. Potkonjak, “Fast prototyping ofdata-path intensive architectures,”IEEE Design &Test, pp. 40–51, June1991.

[19] M. Potkonjak and J. Rabaey, “Fast implementation of recursive pro-grams using transformations,” inProc. IEEE Int. Conf. on Acoustics,Speech, and Signal Processing, San Francisco, CA, Mar. 1992, vol. V,pp. 569–572.

[20] C.-Y. Wang and K. K. Parhi, “High-level DSP synthesis usingconcurrent transformations, scheduling, and allocation,”IEEE Trans.Computer-Aided Design, vol. 14, pp. 274–295, Mar. 1995.

[21] J. Monteiro, S. Devadas, and A. Ghosh, “Retiming sequential circuitsfor low power,” in Proc. IEEE Int. Conf. on Computer Aided Design,1993, pp. 398–402.

[22] S. Simon, E. Bernard, M. Sauer, and J. Nossek, “A new retimingalgorithm for circuit design,” inProc. IEEE ISCAS, London, England,May 1994.

[23] D. Johnson and J. Johnson,Graph Theory with Engineering Applica-tions. New York: Ronald Press, 1972.

[24] E. M. Reingold, J. Nievergelt, and N. Deo,Combinitorial Algorithms:Theory and Practice. Englewood Cliffs, NJ: Prentice-Hall, 1977.

[25] K. K. Parhi, C.-Y. Wang, and A. P. Brown, “Synthesis of control circuitsin folded pipelined DSP architectures,”IEEE J. Solid-State Circuits, vol.27, pp. 29–43, Jan. 1992.

[26] R. I. Hartley and K. K. Parhi,Digit-Serial Computation. Boston, MA:Kluwer, 1995.

[27] T. C. Denk and K. K. Parhi, “Lower bounds on memory requirementsfor statically scheduled DSP programs,”J. VLSI Signal Processing, vol.12, no. 3, pp. 247–264, June 1996.

Tracy C. Denk (S’91–M’96) received the B.S. andPh.D. degrees from the University of Minnesota,Minneapolis, in 1990 and 1996, respectively, andthe M.S. degree from the University of Rochester,Rochester, NY, in 1991, all in electrical engineer-ing.

From 1996 to 1998, he was a Member of Techni-cal Staff at Lucent Technologies, Bell Laboratories,Holmdel, NJ, where he worked on hardware andsoftware implementations of broad-band commu-nication algorithms. He is presently employed at

Broadcom Corporation, Irvine, CA, where he is working on integrated circuitdesign for broad-band digital data transmission. His research interests includealgorithm transformations and high-level synthesis of VLSI architectures forsingle-rate and multirate DSP algorithms.

Dr. Denk was a General Motors Scholar from 1987 to 1990 and receivedan Air Force Laboratory Graduate Fellowship in 1991.

Keshab K. Parhi (S’85–M’88–SM’91–F’96) re-ceived degrees from the Indian Institute of Tech-nology, Kharagpur, India, in 1982, the University ofPennsylvania, Philadelphia, in 1984, and the Univer-sity of California at Berkeley in 1988, respectively.

He is an Edgar F. Johnson Professor of Elec-trical and Computer Engineering at the Universityof Minnesota, Minneapolis. His research interestsinclude all aspects of VLSI digital signal and im-age processing. He is currently working on VLSIdigital filters, adaptive digital filters, equalizers and

beamformers, error control coder architectures, programmable and customDSP architectures, design methodogies, digital integrated circuits, low-powerdigital systems, and computer arithmetic. He has published over 220 papers inthese areas. He held the McKnight–Land Grant Professorship at the universityof Minnesota from 1992 to 1994. He is a former Associate Editor of IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS, IEEE TRANSACTIONS ON SIGNAL

PROCESSING, and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—PART II:ANALOG AND DIGITAL SIGNAL PROCESSINGand is currently an Associate Editorof IEEE TRANSACTIONS ONVLSI SYSTEMS, IEEE SIGNAL PROCESSINGLETTERS,

and an Editor of theJournal of VLSI Signal Processing. He is the Guest Editorof a special issue of the IEEE TRANSACTIONS ON SIGNAL PROCESSINGand twospecial issues of theJournal of VLSI Signal Processing. He served as technicalprogram cochair of the 1995 IEEE Workshop on Signal Processing and the1996 ASAP conference. He is a distinguished lecturer of the IEEE Circuitsand Systems Society.

Dr. Parhi received a 1996 Design Automation Conference best paper award,1994 Darlington, and the 1993 Guillemin–Cauer best paper awards from theIEEE Circuits and Systems Society, the 1991 paper award from the IEEESignal Processing Society, the 1991 Browder Thompson prize paper awardfrom the IEEE, and the 1992 Young Investigator Award of the NationalScience Foundation.


Recommended