Synchronous Data Flow Algorithms

8/13/2019 Synchronous Data Flow Algorithms

1/12

24 IEEE TRANSACTIONS ON COMPUTERS, VOL. '2-36, NO. 1 JANUARY 19

1Static Scheduling of Synchronous Data FlowPrograms for Digital Signal Processing

EDWARD ASHFORD LEE, MEMBER EEE, AND DAVID G MESSERSCHMI TT, FELLOW, IEEE

Abstract-hrge grain data flow (LGD F) programming isnatural and convenient for describing digital signal processing(DSP) systems, but its runtime overhead is costly in real time orcost-sensitive applications. In some situations, designers are notwilling to squander computing resources fo r the sake of program-mer convenience. This is particularly true when the targetmachine is a programmable DS P chip. However, the runtimeoverhead inherent in most LGDF implementations is not requiredfor most signal processing systems because such systems aremostly synchronous (in the DS P sense). Synchronous data f l o w(SDF) differs from traditional data flow in that the amount ofdata produced and consumed by a data flow node is specified apr ior i for each input and output. This is equivalent to specifyingthe relative sample rates in signal processing system. This meansthat the scheduling of SDF nodes need not be done at runtime,but can be done at compile time (statically), so the runtimeoverhead evaporates. The sample rates can all be different, whichis not true of most current data-driven digital signal processingprogramming methodologies. Synchronous data flow is closelyrelated to computation graphs, a special case of Petri nets.This self-contained paper develops the theory necessary tostatically schedule SDF programs on single or multiple proces-sors. A class of static (compile time) scheduling algorithms isproven valid, and specific algorithms are given for schedulingSDF systems onto single or multiple processors.

Index Terms-Block diagram, computation graphs, data flo wdigital signal processing, hard real-time systems, multiprocessing,Petri nets, static scheduling, synchronous data flow.

I . INTRODUCTIONachieve high performance in a processor specialized forT ignal processing, the need to depart from the simplicityof von Neumann computer architectures is axiom atic. Y et , inthe software realm, deviations from von Neumann program-ming are often viewed with suspicion. For example, in thedesign o f most successful com mercial signal processors today[11-[5], compromises are made to preserve sequential pro-gramming. Two notable exceptions are the Bell Labs DSPfamily [6 ] , 7] and the NEC data flow chip [ 8 ] , oth of whichare programmed with concurrency in mind. For he majority,however, preserving von Neumann programming style is

given priority.This practice has a long and distinguished history. Often, anew non-von Neumann architecture has elaborate hardwareManuscript received August 15. 19 85 ; revised March 17. 19 86 . This workwas supported in pan by the National Science Foundation under Grant ECS-8211071, an IBM Fellowship. and a grant from the Shell DevelopmentCorporation.The authors are with the Department of Electrical Engineering andComputer Sc ience, University of California, Berkeley, CA 94720.I EEE og Number 8611442.

and software techniques enabling a programmer to writsequential code irrespective of the parallel nature of thunderlying hardware. For example, in machines with multiplfunction units, such as the C D C W and Cray family, scalled scoreboarding hardware resolves conflicts to ensurthe integrity of sequential code. In deeply pipelined machinesuch as the IBM 360 Model 91, interlocking mechanisms [9resolve pipeline conflicts. In the M.I. T. Lincoln Labs signaprocessor [101 specialized associative memories are used tensure the integrity of data precedences.The affinity for von Neumann programming is not asurprising, stemming from familiarity and a proven tracrecord, but the cost is high in the design o f specialized digitasignal processors. Comparing two pipelined chips that differadically only in programming methodology, the TTMS32010 [2] and the Bell Labs DSP20, a faster version othe DSPl [ 6 ] , we find that they achieve exactly the samperformance on the most basic benchmark, the FIR (finitimpulse response) filter. But the Bell Labs chip outperformthe TI chip on the next most basic benchmark, the IIR (infinitimpulse response) filter. Surprisingly, close examinatioreveals that the arithmetic hardware (multiplier and ALU) othe Bell Labs chip is half as fast as in the TI chip. Thperformance gain appears to follow from the departure f romconventional sequential programming.However, programming the Bell Labs chip is not easy. Thcode more closely resembles horizontal microcode thaassembly languages. Programmers invariably adhere to thquaint custom of programming these processors in assemblerlevel languages, for maximum use of hardware resourcesSatisfactory compilers have failed to appear.In this paper, we propose programming signal processorusing a technique based on large grain data flow (LGDFlanguages [1 I]. which should ease the programming task byenhancing the modularity o f code and permitting algorithms t

be described more naturally. In addition, concurrency iimmediately evident in the program description, so parallehardware resources can be used more effect ively. We begin byreviewing the data flow paradigm and its relationship withprevious methods applied to signal processing. Synchronoudata f l o w (SDF) is introduced, with its suitability fodescribing signal processing systems explained. The advantage of SDF over conventional data flow is that morefficient runtime code can be generated because the data flownodes can be scheduled at compile time, rather than aruntime. A class of algorithms for constructing sequentia(single processor) schedules is proven valid, and a simpl

0 0 18-9340/87/0100-0024$01 .OO 1987 IEEE


2/12

. .i

t

LEE AND,MESSERSCHMITT. STATIC SCHEDUUNGOF SYNCHRONOUS DATA

heuristic for constructing parallel (multiprocessor) schedulesis described. Finally, the limitations of the model areconsidered.

II. THEDATA LOW ARADIGMIn data flow, a program is divided into pieces nodes orblocks) which can execute ( f i r e ) whenever input data areavailable [121, [131. An algorithm is described as a dataf lowgraph, a directed graph where the nodes represent functions

and the arcs represent data paths, as shown in Fig. 1. Signalprocessing algorithms are usually described in the literature bya combination of mathematical expressions and block dia-grams. Block diagrams are large grain d at af lo w (LGDF)graphs, [14]-1161, in which the nodes or blocks may beatomic (from the Greek atornos, or indivisble), such asadders or multipliers, or nonatomic (large grain), such asdigital filters, FFT units, modulators, or phase locked loops.The arcs connecting blocks show the signal paths, where asignal i s simply an infinite stream of data, and each data tokenis called a sample. The complexity of the functions (thegranularify)will determine the amount of parallelism availa-ble because, while the blocks can sometimes be executedconcurrently, we make no attempt to exploit the concurrencyinside a block. The functions within the blocks can bespecified using conventional von Neumann programmingtechniques. If the granularity is at the level of signalprocessing subsystems (second-order sections, butterfly units,etc.), then the specification of a system will be extremelynatural and enough concurrency will be evident to exploit atleast small-scale parallel processors. The blocks can themsel-ves represent another data flow graph, so the specification canbe hierarchical. This is consistent with the general practice insignal processing where, for example, an adaptive equalizermay be treated as a block in a large system, and may be itself anetwork of simpler blocks.LGDF s ideally suited for signal processing, and has beenadopted in simulators in the past [17]. Other signal processing

systems use a data-driven paradigm to partition a task amongcooperating processors [181, and many so called blockdiagram languages have been developed to permit program-mers to describe signal processing systems more naturally.Some examples are Blodi [19], Patsi [20], Blodib [21], Lotus1221,Dare [23],Mitsyn [24], Circus 1251, and Topsirn [26].But these simulators are based on the principle of next statesimulation [20], [27] and thus have difficulty with multiplesample rates, not to mention asynchronous systems. (We usethe term asynchronous here in the DSP sense to refer tosystems with sample rates that are not related by a rationalmultiplicative factor.) Although true asynchrony is rare insignal processing, multiple sample rates are common, stem-ming from the frequent use of decimation and interpolation.The technique we propose here handles multiple sample rateseasily.

In addition to being natural for DSP, arge grain data flowhas another significant advantage for signal processing. Aslong as the integrity of the flow of data is preserved, anyimplementation of a data flow description will produce thesame results. This means that the same software description of

FLOW PROGRAMS 25

Fig. 1. A threenode data flow grsph wit O Y input and two outputs. Thenodes represent functions of arbitrary complexity, and the arcs representpaths on which sequences of data f o k e n s or surnpks) flow.a signal processing system c a n be simulated on a singleprocessor or multiple processors, implemented in specializedhardware, or even, ultimately, compiled into a VLSI chip

III. SYNCHRONOUSATA LOWGRAPHSIn this paper we concentrate on synchronous systems. Atthe risk of being pedantic, we define this precisely. A block isa function that is invoked when there is enough input availableto perform a computation (blocks lacking inputs can beinvoked at any time). When a block is invoked, it wil lconsume a fixed number of new input samples on each inputpath. These samples may remain in the system for some time

to be used as old samples [17], but they will never again beconsidered new samples. A block is said to be synchronous ifwe can specify a priori the number o f input samples consumedon each input and the number of output samples produced oneach output each time the block is invoked. Thus, a synchro-nous block is shown in Fig. 2(a)with a number associated witheach input or output specifying the number of inputs consumedor the number of outputs produced. These numbers are part ofthe block definition. For example, a digital filter block wouldhave one input and one output, and the number of inputsamples consumed or output samples produced would be one.A 2:l decimator block would also have one input and oneoutput, but would consume two samples for every sampleproduced. A synchronous data f low SDF) graph is anetwork of synchronous blocks, as in Fig. 2@).SDF graphs are closely related to computation graphs.introduced in 1966 by Karp and Miller [29 ] and furtherexplored by Reiter [30]. Computation graphs are slightly moreelaborate than SDF graphs, in that each input to a block hastwo numbers associated with it, a threshold and the number ofsamples consumed. The threshold specifies the number ofsamples required to invoke the block, and may be differentfrom the number of samples consumed by the block. It cannot,of course, be smaller than the number of samples consumed.The use of a distinct threshold in the model, however, does notsignificantly change the results presented in this paper, so forsimplicity, we assume these two numbers are the same. Karpand Miller [29] show that computations specified by acomputation graph are determinate, meaning that the samecomputations are performed by any proper execution. Thistype of theorem, of course. also underlies the validity of dataflow descriptions. They also give a test todetermine whether acomputation terminates. which is potentially useful because insignal processing we are mainly interested in computationsthat do not terminate. We assume that signal processing

t281*


3/12

IEEE TRANSACTIONS ON COMPUTERS. VOL. C-36. NO. 1 JANUARY

(b)Fig. 2. (a) A synchronous node. (a) A synchronous data flow graph.

systems repetitively apply an algorithm to an infinite sequenceof data. To make it easier to describe such applications, weexpand the model slightly to allow nodes with no inputs. Thesecan fire at any time. Other results presented in [29] are onlyapplicable to computations that terminate, and therefore arenot useful in our application.

Computation graphs have been shown to be a special case ofPetri nets [31]-[33] or vector addition sys te m [N] Thesemore general models can be used to describe asynchronoussystems. There has also been work with models that arespecial cases of computation graphs. In 1971, Commoner andHolt (351 described marked directed graphs, and reachedsome conclusions similar to those presented in this paper.However, marked directed graphs are much more restrictedsamples produced or consumed on any arc to unity. Thisextessively restricts the sample rates in the system, reducingthe utility of the model. In 1968, Reiter [36] simplified thecomputation graph model in much the same way (with minorvariations), and tackled a scheduling problem. However, hisscheduling problem assumes that each node in the graph is aprocessor, and the only unknown is the firing time for theinvocation of each associated function. In this paper wepreserve the generality of computation graphs and solve adifferent scheduling problem, relevant to data flow program-ming, iv which nodes represent functions that must be mappedOnto processors.

Implementing the signal processing system described by aSDF graph requires buffering the data samples passedbetween blocks and scheduling 6locks so that they areexecuted when data are available. This could be donedynamically, in which case a runtime supervisor determineswben blocks are ready for execution and schedules them ontopc es s ors as they become free. This runtime supervisor maybe a software routine or specialized hardware, and is the same

s the control mechanisms generally associated with data flow.It is a costly approach, however, in that the supervisoryoverhead can become severe, particularly if relatively littlecomputation is done each time a block is invoked.SDF graphs, however, can be scheduled statically atcompile time), regardless of the number of processors, and theoverhead associated with dynamic control evaporates. Specifi-cally, a large grain compiler determines the order in whichnodes can be executed and constructs sequential code for each

than SDF graphs because they constrain the number of

processor. Communication between nodes and between pessors is set up by the compiler, s o no runtime contrrequired beyond the traditional sequential control inprocessors. The LGDF paradigm gives the programmnatural interface for easily constructing well structured sprocessing programs, with evident concurrency, and the lgrain compiler maps this concurrency onto parallel prosors. This paper is dedicated mainly to demonstratingfeasibility of such a large grain compiler.

IV.A SYNCHRONOUSARGE GRAIN OMPILERWe need a methodology for translating from an SDF grto a set of sequential programs running on a numbeprocessors. Such a compiler has the two following basic taAllocation of shared memory for the passing of

between blocks, if shared memory exists, or settingcommunication paths if not.

Scheduling blocks onto processors in such a way that is available for a block when that block is invoked.The first task is not an unfamiliar one. A single procesolution (which also handles asynchronous systems) is gby the buffer management techniques in Blosim [171. Simcations of these techniques that use the synchrony ofsystem are easy to imagine, as are generalizations to mulprocessors, so this paper will concentrate on the second tthat of scheduling blocks onto processors so that dataavailable when a block is invoked.

Some assumptions are necessary.The SDF graph is nonterminating (c f. [29], [30]) m

ing that it can run forever without deadlock. As mentioearlier, this assumption is natural for signal processing.The SDF graph is connected. If not, the separate gracan be scheduled separately using subsets of the processoThe SDF graph is nonterminating (cf. 1291. f301)meathat it can run forever without deadlock. As mentioned ear

this assumption is natural for signal processing.Specifically, our ultimate goal is a periodic admiss

parallel schedule, designated PAPS. The schedule shouldperiodic because of the assumption that we are repetitiapplying the same program on an infinite stream of data. desired schedule is admissible, meaning that blocks wilscheduled to run only when data are available, and that a fiamount of memory is required. It isparallel in that more tone processing resource can be used. A special case periodic admissible sequential schedule, or PASS, whimplements an SDF graph on a single processor. The metfor constructing a PASS leads to a simple solution toproblem o f constructing a PAPS, so we begin with sequential schedule.A . Construction of a PASS

A simpleSDFgraph is shown in Fig. 3, with each block each arc labeled with a number. (The connections tooutside world are not considered, and for the remainder ofpaper, will not be shown. Thus, a block with one input fthe outside will be considered a block with no inputs, whcan therefore be scheduled at any time. The limitations ofapproximation are discussed in SectionV.) An SDFgraph


4/12

~ ? 1 :.* DMESSERSCHMITT:STATIC SCHEDULING OF SYNCHRONOUS DATA FLOW PROGRAMS 27

I 3Fig.3 SDF raph showing the numbering o f the nodes and arcs. The inputand output arcs arc ignored for now.

be characterized by a matrix similar to the incidence matrixassociated with directed graphs in graph theory. It is con-structed by first numbering each node and arc, as in Fig. 3,and assigning a column to each node and a row to each arc.The i , )th entry in the matrix is the amount of data producedbj node j on arc i each time it is invoked. If nodej consumesdata from arc i the number is negative, and if it is notconnected to arc i then the number is zero. For the graph inFig. 3 we get

This matrix can be called a t opo logy matrix, and need not besquare, in general.I f a node has a connection to itself a se r f - l oop ) , hen onlyone entry in I? describes this link. This entry gives the netdifference between the amount of data produced on this link

and the amount consumed each time the block is invoked. Thisdifference should clearly be zero for a correctly constructedgraph, so the I ntry describing a self-loop should be zero.We can replace each arc with a FI FO queue (buffer) to pass

data from one block to another. The size o f the queue will varyat different times in the execution. Define the vector b n) tocontain the queue sizes of all the buffers at time n. In Blosim[171 buffers are also used to store old samples (samples thathave been ?consumed?), making implementations of delaylines panicularly easy. These past samples are not consideredpart of the buffer size here.For the sequential schedule, only one block can be invokedat a time, and for the purposesof scheduling it doesnot matterhow long it runs. Thus, he index n can simply be incrementedeach time a block finishes and a new block is begun. We

specify the block invoked at time n with a vector u n ) , whichhas a one in the position corresponding to the number of theblock that is invoked at time n and zeros for each block that isnot invoked. For he system in Fig. 3, in a sequential schedule,o ( n ) can take one of three values,

depending on which o f the three blocks is invoked. Each timea block is invoked, it will consume data from zero or moreinput arcs and produce data on zero or more output arcs. Thechange in the size of the buffer queues caused by invoking anode is given by

b n+ i)=b(n)+ru(n). (3)

Fig. 4. An example of SDF graph with delays on the arcs.The topology matrix I?characterizes the effect on the buffersof running a node program.

The simple computation model is powerful. First we notethat the computation model handles delays. The term delay isused in the signal processing sense, corresponding to a sampleoffset between the input and the output. W e define a u n i tdelay on an arc from node A to node B to mean that the nthsample consumed by B will be the n - 1)th sample producedby A . This implies that the first sample the destination blockconsumes is not produced by the source block at all, but is panof the initial state of the arc buffer. Indeed, a delay of dsamples on an arc is implemented in our model simply bysetting an initial condition for (3). Specifically, the initialbuffer state, b O), should have a d in the position correspond-ing to the arc with the delay of d units.

T o make this idea firm, consider the example system in Fig.4.The symbol ?0?n an arc means a single sample delay,while ?20? means a two-sample delay. The initial conditionfor the buffers is thus - -

(4)Because of these initial conditions, block 2 can be invokedonce and block 3 twice before block 1 is invoked at all.Delays, therefore, affect the way the system starts up.Given this computation model we can

find necessary and suffic ient conditions for the existence

find practical algorithms that provably find a PAS S if onefind practical algorithms that construct a reasonable (but

W e begin by showing that a necessary condition for the

of a PASS, and hence a PAPS;exists;not necessarily optimal) PA PS , if a PASS exists.existence of a PASS is

rank (I?)=s-l (5)where s s the number o f blocks in the graph. We need a seriesof lemmas before we can prove this. The word ?node? is usedbelow to refer to the blocks because it is traditional in graphtheory.Lemma I All topology matrices for a given SDF graphhave the same rank.ProoJ Topology matrices are related by renumbering ofnodes and arcs, which translates into row and columnpermutations in the topology matrix. Such operations preservethe rank. Q.E.D.

Lemma 2: A topology matrix for a t ree graph has ranks - 1 where s is the number of nodes a tree is a connectedgraph without cycles, where we ignore the directions o f thearcs).Proof: Proof is by induction. The lemma is clearly truefor a two-node tree. Assume that for an N node tree


5/12

IEEE TRANSACTIONS ON COMPUTERS.VOL. C-36. NO. I . JANUAR I 198

I

rank rN) = N - 1. Adding one node and one linkconnecting that node to our graph will yield an + 1 nodetree. A topology matrix for the new graph can be writtenr I0

Fig. 5. Example of a defective SDF B n p h with sample rate inconsistencirN+I=[7 The topology matrix iswhere 0 is a column vector full o f zeros, and p f s a rowvector corresponding to the arc we ust added. The last entry inthe vector is nonzero because'the node we j u s t addedcorresponds to the last column, and it must be connected to thegraph. Hence, the last row is linearly independent from theother rows, so rank rN,J = rank rN) + I. Q.E.D.Lemma 3: For a connected SDF graph with topologymatrix I

[: f -'] rank (I )=s=3.= o 0 Irank r)2 - 1

where s is the number of nodes in the graph.Proof : Consider any spanning tree of the connectedSDF graph a spanning tree is a tree that includes every nodein the graph). Now define r, o be the topology matrix for thissubgraph. By Lemma 2 rank (T,) = s - 1. Adding arcs tothe subgraph simply adds rows to the topology matrix. Addingrows to a matrix can increase the rank, if the rows are linearlyindependent of existing rows, but cannot decrease it. Q.E.D.

Proof: r has only s columns, so its rank cannot exceed s.Therefore, by Lemma 3, s and s - I are the onlypossibilities. Q.E.D.

Definition I An admissible sequential schedule + is anonempty ordered list of nodes such that if the nodes areexecuted in the sequence given by 4 the amount of data in thebuffers (' buffer sizes") will remain nonnegative andbounded. Each node must appear in b at least once.

A periodic admissible sequential schedule (PASS) is aperiodic and infinite admissible sequential schedule. It isspecified by a list b that is the list of nodes in one period.

For the example in Fig . 6, b = { 1,2, 3, 3) is a PASS, but4 = 2, 1, 3, 3 ) is not because node 2 cannot be run beforenode 1. The list 4 = { 1, 2, 3 ) is not a PASS because theinfinite schedule resulting from repetitions of this list willresult in an infinite accumulation of data samples on the arcsleading into node 3.Theorem I For a connected SDFgraph with s nodes andtopology matrix r, ank (F)= s - 1 is a necessary conditionfor a PASS to exist.Proo$ We must prove that the existence of a PASS ofpcnodp implies rank (T ) = s - 1. Observe from (3) that wec a n write

Corollury: rank r) s s - 1 or s.

where

n = OSince the PASS is periodic, we can write

b np)=b 0)+ n r q .

Fig. 6. An SDF graph with consistent sample rates has a positive integevector q in the nullspace of the topology matrix r.

r = [ -: - 1I .=:I cv r)Since the PASS is admissible, the buffers must remainbounded, by Definition 1. The buffers remain bounded if andonly if rq=owhere 0 s a vector full of zeros. For q # 0 his implies tharank r ) < s where s s the dimension ofq . From the corollaro f Lemma 3, runk r ) is either s or s - 1, and so it must bs - I. Q.E.D

This theorem tells u s that if we have a SDF graph withtopology matrix of rank s, that the graph is somehowdefective, because no PASS can be found for it. Fig. 5illustrates such a graph and its topology matrix. Any schedulfor this graph will result either in deadlock or unboundebuffer sizes, as the reader can easily verify. The rank of thtopology matrix indicates a sample rate inconsistency in thgraph. In Fig. 6, by contrast. a graph without this defect ishown. The topology matrix h a s rank s - 1 = 2. so we canfind a vector q such that r q = 0 Furthermore, the followingtheorem shows that we can find a positive integer vector q inthe nullspace of r.This vector tells us how many times weshould invoke each node in one period of a PASS. Refemngagain to Fig. 6 , the reader can easily verify that if we invokenode 1 once, node 2 once, followed by node 3 twice, that thebuffers will end up once again in their initial state. A5 beforewe prove some lemmas before getting to the theorem.Lemma 4: Assume a connected SDF graph with topologymatrix r. Let q be any vector such that r q = 0 Denoteconnected path through the graph by the set B = { b,, * * . bLwhere each entry designates a node, and node b l iconnected to node bz.node bz to node b3,up to b L .Then alq, , E B are zero, or all are strictly positive, or all are strictlnegative. Furthermore, if any q, is rational then all q, arrational.


6/12

P r o o f : By induction. First consider a connected path oftwo nodes, B2 = b l , 2}. f the arc connecting these twonodes is thejth arc. thenwhere

i - I

29

(by definition of the topology matrix, thejth row has only twoentries). Also by definition. r,,, and rj are nonzero integersof opposite sign. The lemma thus follows im ediately for B2,E,,, is trivial, using the same reasoning as in the proof for E2,and considering the connection between nodes b,,and b n I.Corollary: Given an SDF graph as in Lemma 4, either all q

Now assuming the lemma is true for B,,p ving it true for

I are zero, or all are strictly positive, or all are strictly negative.Furthermore, i f any one qr s rational, then all are.P r o o f : In a connected SD F graph, a path exists from anynode to any other. Thus, the corollary follows immediatelyfrom the lemma.Theorem 2: For a connected SDF graph with s nodes andtopology matrix r , and with r a n k r ) = s - 1, we can find apositive integer vector q f 0 such that r4 = 0 where 0 sthe zero vector.Proof: Since r a n k r ) = s - I vector u # 0 an befound such that u = 0 . Furthermore, for any scalar CY

r (a u ) = 0 . Let Y = l / u l and u ' = a u . Then u ; = 1 andby the corollary to lemma 4 , all other elements in u ' arepositive rational numbers. Let tl be a common multiple of allthe denominators of the elements of u and let = qu . Thenq is a positive integer vector such that r q = 0 . Q.E.D.It may be desirable to solve for the smallest positive integervector in the nullspace, in the sense o f the sum o f the elem ents.To do this, reduce each rational entry in u' so that itsnumerator and denominator are relatively prime. Euclid'salgorithm (see for example [ 3 7 ] ) will work for this. Now findthe least common multiple I of all the denominators, againusing Euclid's algorith m. Now vu ' is the smallest positiveinteger vector in the nullspace of IWe now have a necessary condition for the existence of aPASS, that the rank ofr be s - 1 . A sufficient condition andan algorithm for finding a PASS would be useful. We nowcharacterize a class o f algorithms that will find a P ASS if suchexists, and will fail clearly if not. Thus, successful completionof such an algorithm is a sufficient condition for the existenceof he PASS.Definition 2: A predecessor to a nodcx is a node feedingdata to x .Lemma 5: To determine whether a node x in a SDF graphcan be scheduled at time i , it is sufficient to know how manytimes x and its predecessors have been scheduled, and to knowb O), he initial state of the buffers. That is, we need not knowin what order the predecessors were scheduled nor what othernodes have been scheduled in between.

P r o o f : To schedule node 'I, each input buffer must havesufficient data. The size of each input bufferj at time i is givenby [ b i ) ] , ,he j th entry in the vector b i ) . From (3) we canwrite

n - 0

The vector q ( i ) only contains information about how manytimes each node has been invoked before iteration i Thebuffer sizes [ b i ) ] , clearly depend only on [b(O) ] , and[r q i )] , . Thejth row of I has only two entries. correspond-ing to the two nodes connected to the j t h buffer, so only thetwo corresponding entries of the 4 i ) vector can affect thebuffer size. T hese entries sp ecify the number of times x and itspredecessors have been invoked. so this information and theinitial buffer sizes [b O)], s all that is needed. Q.E.D.Definition 3: (Class S algorithms) Given a positive integervector 4 s. t . l 4 = 0 nd an initial state for the buffers b O),the ith node is runnable at a given time if it has not been run qltimes and running it will not cause a buffer size to go negative.A cluss S algorithm is any algorithm that schedules a node if itis runnable, updates B n)and stops rerminates)only when nomore nodes are runnable. If a class S algorithms terminatesbefore it has scheduled each node the number of timesspecified in the 4 vector, then it is said to be deadlocked.Class S algorithms ( S for Sequential) construct staticschedules by simulating the effects on the buffers o f an actualrun. That is, the node programs are not actually run. But theycould be run, and the algorithm would not change in anysignificant way. Therefore, any dynamic (runtime) schedulingalgorithm becomes a class S algorithm simply by specifying astopping condition, which depends on the vector 4 It isnecessary to prove that the stopping condition is sufficient toconstruct a PASS for any valid graph.Theorem3: Given a S D F graph with topology matrix I andgiven a positive integer vector s.t. r q = 0 f a PASS o fperiod p = 1 fq exists, where 1 is a row vector full of ones,any class S algorithm will find such a PASS.Proof: It is sufficient to prove that if a PASS 4 of anyperiod p exists, a class S algorithm will not deadlock beforethe termination condition is satisfied.exists, and define + n) o be its firstn entries, for any n such that 1 < n < p . Assume a givenclass S algorithm iteratively constructs a schedule, and definex n ) o be the list of the first n nodes scheduled by iteration n.We need to show that as n increases, the algorithm willbuild x n) and not deadlock before n = p . when thetermination condition is satisfied. That is, we need to showthat for all n E 1, , ) , here is a node that is runnable forany x n ) that the algorithm may have constructed.If x n ) s any permutation ofW , hen the n + 1)th entryin @ is runnable by Lemma 5 because all necessary predeces-sors must be in @ n ) , and t h u s in ~ n ) . therwise, the firstnode a in @(n) nd not in x n) is runnable, also by Lemma 5.This is true for all n f 1, * , ) , so the algorithm will notdeadlock before n = p.At n = p,each node i has been scheduled qr imes becauseno node can be scheduled m ore that q , times (by Definition 3).and p = 1 rq, Therefore, the termination condition issatisfied, and ~ p )s a PASS. Q.E.D.

Assume that a PASS

LEE AND MESSERSCHMI'TT STATIC SCHEDULING OF SYNCHRONOUS DATA FLOW PROGRAMS

-.". -


7/12

i

30

Fig. 7. Two SDF graphs with consistent sample rates but no admis s i b l eschedule.Theorem 3 tells u s that if we are given a positive integervector 9 in the nullspace of the topology matrix, that class Salgorithms will find a PASSwith its period equal to the sum ofthe elements in the vector, if such a PASS exists. 1t is possible,even if rank r)= s - 1 for no PASS to exist. Two suchgraphs are shown in Fig 7. Networks with insuffic ient delaysin directed loops are not computable.One problem remains. There are an infinite number ofvectors in the nullspace of the topology matrix. How do weselect one to use in the class S algorithm? We now set out toprove that given any positive integer vector in the nullspace ofthe topology matrix, if a class S algorithm fails to find a PASSthen no PASS o f any period exists.Lemma 6: Connecting one more node to a graph increasesthe rank of the topology matrix by at least one.The proof of this lemma follows the same kinds ofarguments as the proof of Lemma 2. Rows are added to thetopology matrix to describe the added connections to the newnode, and these rows must be linearly independent of rowsalready in the topology matrix.Lemma 7: For any connected SDF graph with snodes andtopology matrix r, connected subgraph L with m nodes has atopology matrix r r or which

r a n k r )=s - 1;.e. . all subgraphs have the right rank.

r a n k r L ) M -Proo - By contraposition. We prove that

rank rL)+ rn- 1 rank(r)+s- 1.From the corollary to Lemma 3, f rank FL)# rn - 1 thenr u n k r L ) = m . Then runk r ) 2 m + s - m = s, byrepeated applications of Lemma 6 , so rank r ) = s. Q.E.D.The next lemma shows that given a nullspace vector q , inorder to run any node the number of times specified by thisvector, it is not necessary to run any other node more than thenumber of times specified by the vector.

L e m m a 8: Consider the subgraph of a SDF graph formedby any node CY and all its immediate predecessors (nodes thatfeed it data, which may include CY itself ). Construct a topologymatrix r for this subgraph. If the original graph has a PASS,then by Theorem 1 and Lemma 7 , ank I) = m - 1 wherem is the number of nodes in the subgraph. Find any positive

IEEE TRANSACTlONSON COMPUTERS. VOL C-36.NO I. JANUARY 1987integer vector q s.t. r q = 0 . uch a vector exists because ofTheorem 2. Then it is never necessary to run any predecessorP more than qs imes in order to run CY x times, for any x 44. From the definition of Iand 9 we know that aq, = bqa where a and b are the amounto f data consumed and produced on the link from ,d to a .Therefore, running 6 only 48 times generates enough data onthe link to run cr 4a imes. More runs will not help. Q.E .D.Theorem 4: Given a SDFgraph with topology matrix r nda positive integer vector q s.t. I q = 0 , PASS of periodp = 1 q exists i f and only if a PASS o f period N p exists forany integer N.

9 a .

P ro o f :Parr I t is trivial to prove that the existence of aPASS of period p implies the existence of a PASS o f periodNp because the first PASS can be composed N times toproduce the second PASS.Part 2: We now prove that the existence of a PASS 4of period N p implies the existence of a PASS of period p.Consider the subset 8 of b , containing the first qa runs of eachnode CY . If 8 is the first p elements of6 then it is a schedule ofperiod p and we are done. It it is not, then there must be somenode f l that is executed more than qs imes before all nodeshave been executed 9 times. But by Lemma 8, hese morethan q executions of 6 annot be necessary for the later lessthan or equal to q executions of other nodes. Therefore. theless than or equal to 4 executions can be moved up in thelist b , so that they precede all more than 4 executions of 13yielding a new PASS 4 of period Np. f this process isrepeated until all less than q executions precede all morethan q executions, then the first p elements of the resultingschedule will constitute a schedule of period p. Q . E . D .

CoroNary: Given any positive integer vector 4 E q I). thenull space of r, PASS o f period p = 1 9 exists i f and only i fa PASSexists of period r = 1 for any other positive integervector u f q r ).P r o o f : For any P SS at all to exist, it is necessary thatr a n k r ) = s - 1, by Theorem 1 . So the nullspace of r hasdimension one, and we can find a scalar c such that9 = cu.

Furthermore. if both o f these vectors are integer vectors, thenc is rational and we can writendc = -

where n and dare both integers. Therefore,dq =nu.

By Theorem 4,aPASSo f period p = 1 q exists if and only ifa PASS of period dp = 1T ( d 9 ) xists. B y Theorem 4 again, aPASS o f period dp exists if and only if a PASS of period r =1Tu exists. Q . E . D .

Discussion: The four theorems and their corollaries havegreat practical importance. We have specified a very broad


8/12

cl ss of algorithms, designated class S algorithms, which,given a positive integer vector q in the nullspace of the,apology matrix, find a PASSwith period qua l to the sum ofthe elements in q, Theorem 3 guarantees that these algorithmswill find a PASS if one exists. Theorems 1 and 2 guaranteet h a t such a vector q exists if a PASS exists. The corollary toTheorem 4 tells us that it does not matter what positive integervector we use from the nullspace of the topology matrix, so wean simplify our system by using the smallest such vector, thusobtaining a PASSwith minimum period.

G1vi.n these theorems, we now give a simple sequentialxheduling algorithm that is of class S, and therefore will finda PASS if one exists.

) Solve for the smallest positive integer vector q E q(I).2 ) Form an arbitrarily ordered list L of all nodes in the3 , For each (Y E L, chedule (Y i f it is runnable, trying each4 ) If each node (Y has been scheduled q a times. STOP.51 I f no node in L an be scheduled, indicate a deadlock an6 Else. go to 3 and repeat.Thcnrern 3 tells us that this algorithms will not deadlock if aPAS4 exists. Two SDF graphs which cause deadlock and have

no P.4SS are shown in Fig. 7 .Sirice the runtime is the same for any PASS (the onemachine available is always busy), no algorithm will produce abe t r c r runtime than this one. However. class S algorithms existw h i ~ construct schedules minimizing the memory required tobu i i c r data between nodes. Using dynamic programming orinteger programming, such algorithms are easily constructed.

A large grain data flow programming methodology offersconc rete advantages for single processor implementations.Thc ability to interconnect modular blocks o f code in a naturalwa could considerably ease the task of programming high-performance signal processors, even if the blocks of codethcniselves are programmed in Assembly language. The gainis >ornewhat analogous to that experienced in VLSI designthrough the use of standard cells. For synchronous systems,thc penalty in runtime overhead is minimal. But a singlepr wessor implementation cannot take advantage of theconcurrency in a LGDF description. The remainder of thispaper is dedicated to explaining how thexoncurrency in thede\cription can be used to improve the throughput of amultiprocessor implementation.B.Constructing a PAPS

Clearly, i f a workable schedule for a single processor can beEenerated, then a workable schedule for a multiprocessors stem can also be generated. Trivially, all the computationcould be scheduledonto only one of the processors. However,ln general, the runtime can be reduced substantially bydistributing the load more evenly. We show in this section howthe multiprocessor scheduling problem can be reduced to afdmiliar problem in operations research for which goodheuristic methods are available.We assume a tightly coupled parallel architecture. so that

system.node once.

error in the graph).

communication costs are not the overriding concern. Further-more, we assume homogeneity; all processors are the same, sothey process a node in a SDF graph in the same amount oftime. It is not necessary that the processors be synchronous,although the implementation will be simpler if they are.A periodic admissibleparallel schedule PAPS) is a set oflists { i ; i = 1, - . e , M} here M is the number ofprocessors. and , specifies a periodic schedule for processori . If r is the corresponding PASS with the smallest possibleperiod P,, hen it follows that the total number Pp of blockinvocations in the PAPS should be some integer multiple J ofP,, e could, of course, choose J = 1, but as we will showbelow, schedules that run faster might result if a larger J isused. If the best integer J is known, then construction of agood PAPS is not too hard.

For a sequential schedule, precedences are enforced by theschedule. For a multiprocessor schedule, the situation is not sosimple. W e will assume that some method enforces theintegrity of the parallel schedules. That is, if a schedule on agiven processor dictates that a node should be invoked, butthere is no input data for that node, then the processor haltsuntil these input data are available. The task of the scheduler isthus to construct a PAPS that minimizes the runtime for oneperiod o f the PAPS divided by J, and avoids deadlocks. Themechanism to enforce the integrity of the communicationbetween blocks on different processors could use semaphoresin shared memory or simple instruction-count synchroniza-tion, where no-ops are executed as necessary to maintainsynchronicity among processors, depending on the multipro-cessor architecture.

The first step is to construct an acyclic precedence graph forJ periods of the PASS4. A precise (class S lgorithm will begiven for this procedure below, but we start by illustrating itwith the example in Fig. 8.The SDFgraph in Fig. 8 s neitheran acyclic nor a precedence graph. Examinationof the numberof inputs consumed and outputs produced for each blockreveals that block 1 should be invoked twice as often as theother two blocks. Further, given the delays on two of the arcs,we note that there are several possible minimum periodPASSs,e.g.,r#JI = {1,3, 1 , 2 } , & = (3, 1, 1 , 2 } , o r d l ={ 1, 1, 3, 21, each with period P, = 4.A schedule that isnot aPASS is bl = (2, 1 , 3, 1) because node 2 is not immediatelyrunnable. Fig . 9(a) shows the precedences involved in all threeschedules. Fig. 9(b) shows the precedences involved in tworepetitions of these schedules J = 2).Ifwe have two processors available, a PAPS for J = 1 (Fig.9(a)) is

1 = (31$ 2 = { 1 , I, 2).

When this system starts up, bldcks 3 and 1 will runconcurrently. The precise timing o f the run depends on theruntime of the blocks. If we assume that the runtime of block 1is a single time unit, the run time of block 2 is 2 ime units, andthe runtime o f block 3 is 3 time units, then the timing is shownin Fig. 10 a). W e assume for now that the entire system isresynchronized after each execution of one period of thePAPS.


9/12

32 IEEE lXANSACTJONS ON COMPUTERS VOL C-36. NO I ANUARY 198

Fig. 8. An example.

(a)

J- 1

0Fig. 9. Acy clic preced ence graphs for (a) a minimum period J = 1). and(b) a double period J = 2) schedule.

PROC 1 1 7 1

PROC 1 3 I l l 3 1P R O C ~ 1 ~ 1 1 1 1 1 2 1

(b)

Fig.9.Fig. 10. T w o schedules generated from the acyclic precedence graphs of*A PAPS constructed for J = 2 , using the precedence grapho f Fig. 9@), will however, perform better. Such a PAPS isgiven by

+ 2 = { l r 1 2, 1, 2and its timing is shown in Fig. lo@). Since both processorsare kept always busy, this schedule is better than the J = 1schedule, and no better schedule exists.

The problem of constructing a parallel schedule given anacyclic precedence graph is a familiar one. It is identical withassembly line problems in operations research, and can besolved for the optimal schedule, but the solution is combinato-

>@1

5 3 1

(b)indicated.Fig. 1 1 The two acyclic precedence graphs of Fig. 9 with the levels

rial in complexity. This may not be a problem for small SDFgraphs, and for large one we can use well-studied heuristicmethods, the best being members of a family of criticalpath methods [38].An early example, known as the Hu-levelscheduling algorithm [39], closely approximates an optimasolution for most graphs [40], [38], and is simple. Toimplement this method, a level is determined for each node inthe acyclic precedence graph, where the level of a given nodeis the worst case o f the total o f the runtimes of nodes on a pathfrom the given node to the terminal node of the graph. Theterminal node is a node with no successors. If there is nounique terminal node, one can be created with zero runtime.This node is then considered a successor to all nodes thatotherwise have no successors. Fig. 1 (a) shows the levels forthe J = 1 precedence graph and Fig. 1 (b) shows them for theJ = 2 precedence graph, for the example of Fig. 8. Finally,the Hu-level scheduling algorithm simply schedules availablenodes with the highest level first. When there are moreavailable nodes with the same highest level than there areprocessors, a reasonable heuristic is to schedule the ones withthe longest runtime first. Such an algorithm produces theschedules shown in Fig. 10, the optimal schedules for thegiven precedence graphs.W e now give a class S algorithm that systematicallyconstructs an acyclic precedence graph. First we need tounderstand how we can determine when the execution of aparticular node is necessary for the invocation of anothernode.Consider a SDF graph with a single arc Q connecting node 1to node a . Assume this arc is part of a SDF graph withtopology matrix r.The number of samples required to run CY jtimes is - j r @ here r O as the entry in the topology matrixcorresponding to the connection between arc o and the node CY.Of these samples, 6 are provided as Init ial conditions. If6 2 - j r @hen there isnodependence of thejth run of CY ont . Otherwise, the number of samples required of is - j rO- 6 . Each run of q producesrCmamples, Therefore, thejth


10/12

33EE ANDMESSERSCXMITT: STATIC SCHEDULINGOF SYNCHRONOUS DATA FLOW PROGRAMS

t

i

run o f a depends on the first d runs of where(7)

and where the notation r 1 indicates the ceiling function.Now we give a precise algorithm. We assume that we aregiven the smallest integer vector u in the nullspaceof Tand the ultiple J, o that we wish to construct an acyclicprccedeDctgraph with the number of repetitions of each nodegiven by Ju.We will discuss later how we get J. Each time weadd a node to the graph we will increment a counter i , updatethe buffer stateb i ) ,and update the vector q ( i ) defined in (6).This latter vector indicates how many instances of each nodehave been put into the precedence graph. W e let L designatean arbitrarily ordered list of all nodes in the graph.INITIALIZATION:i =O ;

The Mnia Body:while nodes are runnable {if (Y is runnable then {

- i r m ,,rs

4 0 ) = 0;

for each c r E L {create the (q , ( i ) + 1 th instance of the node a;f o r each input arc u on (Y {kt 9 be the pidecessor node for arc u ;compute d using 7 ) ;if d


11/12

3 4 grammer convenience without squandering computation re-connections between blocks indicate the flow of data samples,and the function of each block can be specified using aconventional programming language. Blocks are executedwhenever input data samples are available. Such a descriptionis cal led large g r a i n d a t a l o w (LGDF). The advantages ofsuch a description are numerous. First, it is a natural way todescribe signal processing systems where the blocks aresecond order recursive digital filters, FFT butterfly operators,adaptive filters, and so on. Second, such a description exhibitsmuch of the available concurrency in a signal processingalgorithm, making multiple processor implementations easierto achieve. Third, program blocks are modular, and may bereused in new system designs. Program blocks are viewed asblack boxes with input and output data streams, so reusing aprogram block simply means reconnecting it in a new system.Fourth, multiple sample rates are easily described under theprogramming paradigm.We describe highefficiency techniques for converting alarge grain data flow descriptionof a signal processing systeminto a s e t of ordinary sequential programs that run on parallelmachines (or, as a special case, a single machine). Thisconversion is accomplished by a l a r g e g r a i n c omp i l e r socalled because it does not translate a high-level language into alow-level language, but rather assembles pieces of code(written in any language) for sequential or parallel execution.MostDSP systems are synchronous, meaning that the samplerate of any given data path, relative to other data paths, isknown at compile time. Large grain data flow graphs withsuch sample rate information are called synchronous data f lowgraphs. Given sample rate information, techniques are given(and proven valid) for constructing sequential or parallelschedules that will execute deterministically, without theruntime overhead generally associated with data flow. For themultiprocessor case, the problem of constructing a schedulethat executes with maximum throughput is shown to beequivalentto a standard operations research problem with wellstudied heuristic solutions that closely approximate the opti-mum. Given these techniques, the benefits of large grain dataflow programming c a n be extended to those signal processingapplications where performance demands are so severe thatlittle inefficiency for the sake of programher convenience canbe tolerated.

ACKNOWLEDGMENTThe authors gratefully acknowledge helpful suggestionsfrom tbc anonymous reviewers, R. Rathbone, and R Righter.

/ sources. Programs are described as block diagrams where

REFERENCES111 Signa l Processing Per i ph ero l , data sheet for the S282 11, A MI, Inc.[21 T M s f Z O f O Use r ' s Gu i de , Texas Instruments, Inc., Dallas, T X , 1983.[31 T. suda, et u l . , A high-prfonnnnce LSI digital signal processor forcommunication, in Proc. ZEEE Zni. ConJ C ommu n . . June 19,[41 Dig i t a l S igna Processo r , d t sheet for the uPD7720 signal processorinterface (SPl) , NEC Electronics U.S.A. Inc.151 S. Magar, D. Essig, E. Caudel, S. Marshall, and R . Peters, AnNMOS digital signal processor with multiprocessing capability,

ISSCC 85 Dig . Tech . Pope rs , Feb. 13, 1985.R . C.Chapman,Ed. Digital signal processor, Bel l Syst . T ech. J . ,

1983.

I61

? - . . . , * - .. . a

IEEE TRANSACTlONS ON COMPUTERS. OL. C-36. NO. I . JANUARY 1987vol. 60.Scpt. 1981.R.N. Kershaw, et d.,A programmable digital signal proces sor with32b floating point arithmetic. ISSCC 5 Dig . Tech. Papers, Feb. 13,1985.M. Chase, A pipelined data flow lrchiteaure for signal processing:The NEC uPD7261, in VLS s i gna / P rOce s i n g . New York: IEEEPress, 1984.P . M. Kogge. Th e A r c h i t e c t u n of P i p el i n e d Comp u t e r s . NewYork: McGraw-Hill, 1981.D. . Paul, J. A. Feldman, and V. . Sfemn o. A design study for anw i l y programmable, high-speed processor with a general-purposearchitecture, Lincoln Lab, Massachusetts Inst. Tcchnol.. Cam bridge,MA, Tech. Note 1980-50, 1980.W. . Ackennan, Data flow languages. Compu te r , vol. 15. Feh.1982.J . 9 Dennis. Data flow supercomputers. c o m p u t e r . vol. 13. Nov.1980.1. Watson and J Gurd, A practical data flow computer, Compu te r .vol. 15, Feb. 1982.A. L. Davis, The architecture and system method of DDMI: Arecursively structured data driven machine, in Proc. Fifrh A n n .S ymp . Comp u t . A r c h i t e c t . . Apr. 1978. pp. 210-215.J. Rumbaugh, A data flow multiprocessor. IEEE T r an s . Co r n p u t . .vol. C-26, p. 138. Feb. 1977.R. G. Babb, Parallel processing with large grain data flow tech-niques,'. Compu te r . vol. 17, July, 1984.D. G . Messerschmitt, A tool for structured functional simulation,IEEE J. Selec t . A reas Comm un . . vol. SAC-2, Jan. 1984.L. Snyder, Parallel programming and the poker programmingenvironment.'' Comp u t e r , vol. 17, July 1984.Kelly, Lochbaum, and Vyssotsky. A block diagram compiler. BellSyst. Tech. J . . vol. 40,May 1%1.B.Gold and C. Rader. Dig i t a l Process ing of Signa ls. New York:McGraw-Hill, 1969.B. Karafin, The new block diagram compiler for simulation ofsampleddata systems,'' in A F f P S Co n f . P r o c . . vol. 27 1%5, pp.M. ertmzous. M. aliske. and K. Polzen. On-line simulation o fblock-diagram systems. fEEE T r an s . Compu t . . vol. C-18, Apr.1 9.G . Korn, High-speed block-diagram languages for microprocessorsand minicomputers in instrumentation, control. and simulation.Co r npu t . Elec. Eng . . vol. 4 pp. 143-159. 1977.W . Henke, MITSYN-A n interactive dialogue language for timesignal processin g. Res. Lab . Electronics Massachusetts Inst. Tech-nol., Cambridge, MA. RLE-TM-I, Feb. 1975.T. Crystal and L.Kulsmd, Circus, Inst. Defense Anal.. Princeton.NJ, CRD working paper, Dec. 1974.TOPS IM I f f -S i mu l a t i o n Pa c k a g e o r Commun i c a t i o n Sysmns.user s manual; Rep. Elec. Ens., Politencia di Torino. Italy.G . Kopec, The representation of discrete-time signals and systems inprogram, B . D . dissertation, Massachusetts Inst. Technol.. Cam-bridge, MA. May 1980.C S. Jhon, G . E . Sobelman. and D. E. Krekelberg. Siliconcompilation bascd on a data-flow paradigm. IEEE Ci r c . Devi ces ,voi. 1. May 1985.R . M. Karp and R. E. Miller, Properties of a model for parallelcomputations: llcterminacy. termination. queueing. H A M J.. vol.R. Reiter. A study of a model for parallel computations. Ph.D .dissenetion. Univ. Michigan. Ann Arbor. 1967.J . L. Peterson. Petri nets, Co r npu t . S um . . vol. 9. Sept. 1977.e r r i Net Theo ry ond th e M o d c l i n g of Systems. EnglewoodCliffs, NJ: Prcntice-Hall. 1981.T. Agerwala, Putting Pcm nets to work.'' Compu te r . p. 85, Dec1979.R . M. Karp and R . E. Miller. Parallel program schemata. J .Com pu t . S-vst . S c i . . vol. 3. no. 2. pp. 147-195. 1969.F. Commoner and A. W. Holt, Marked directed graphs. J .Comp u t . Sysr S c i . . vol. 5, pp. 511-523. 1971.R . Reiter. Scheduling parallel computations. J. Ass . Co rnpu t .M a c h . , vol. 14 pp. 590-599. 1968.R . E. Blahut, Fmt A l g o r i t hm s for Dig i ta l S igna l P rocess ing .Reading, MA: Addison-Wesley. 1985.T.L. Adam. K. M. Chandy, and 1 R . Dickson. A comparisonof istschedules for parallel processin g systems. C ommu n . ASS.Co r npu t .M a c h . . vol. 17. pp. 685-690. Dec.. 1974.

55-61.

14, pp. 1390-1411, NOV.1966.


12/12

i

LEE AND MESSERSCHMITT:STATIC SCHEDULING OF SYNCHRONOUS D A T A FLOW PROGRAMS 35[39] T . C. Hu, P arallel sequencing and assembly line problem s, Operaf.I401 W. H. Kohler, A preliminary evaluation of the critical path methodfor scheduling tasks on multiprocessor systems, IEEE Trans.

Rs., VOI. , pp. 841-848. 1%1.COm p U I . . VOI. C-25, pp. 1235-1238. DcC. 1975.

Edward Asbford L e S80-MM) received theB.S. degree from Yole University. New Haven,CT , in 1979, the S.M. degree from the Massachu-setts Institute of Technology. Cambridge, in 1981,and the Ph.D. degrce from the University ofCalifornia, Berkeley. in 1986.From 1980 to 1982 he was with Bell Laborato-ries. as a member of the Technical Staff of the DataCommunications Labo ratory , where he did cxplor-atory work in voiceband data modem techniquesand simultaneous voice and data transmission. SinceJuly 1986 he has been an Assistant Professor in the Department of ElectricalEngineering and Computer Science. University of California, Berkeley. Hisresearch interests include architectures and softw are techniques for program-mable digital signal processors, parallel processing. real-time software,computer-aidedengineering for signal processing and com munications, digitalcommunications, and quan tization. He has taught a shon cou rse at theUniversity of Califor nia, Santa Barbara , on telecommunications applicationso f progammable digital signal proc essor s, has consulted in industry, and holdsone patent.Dr. Lee was the recipient of IBM and GE Fellowships and the SamuelSilver Memorial Scholarship Award. He is a member of Tau Beta Pi.

DSVM C . McsKrschnltt (S65-M68-SM78-F83) eceived the B.S. egree from the Universityof Colorado. Boulder, In 1967. and the M.S andPh.D. egrees from the University of Michigan,Ann Arbor, in 1968 and 1971. respectively.He is a Professor o f Electrical Engineenng andComputer Sciences at the University of Califo rnia.Berkeley. From 1968 to 1977 he was a Member ofthe Technical Staff pnd later Supervisor at BellLaboratories. Holmdel, NJ, where he did systemsengineering, development, and research on digitaltransmission and digital signal processing (panicularly relating to speech

processing). His cumnt research interests include analog and digital signalprocessing, adaptive filtering. digital communications (on the subscriber loopand fiber optics), architecture and software approaches to programmabledigital signal processing, communication network design and protocols. andcomputer-aided design o f communications and signal processing systems Hehas published over 70 papers and has 10 patcnu issued or pending in thesefields. Since 1977 he has also served as a consultant to a number ofcompanies He has organized and participated in a number of short coursesand seminars devoted to continuing engineering education.Dr. Messcrschmin is a member ofEta Kappa Nu. Tau Beta Pi. and SigmaXi, and hasseveral bes t papcr awards. He is currently a Senior Editor of IEEECommunications Magazine, and is past Editor for Transmission Systems ofthe IEEE TRANSACTIONS N COMMUNICATIONS and past member o f theBoard o f Governors of the IE EE Communications Society.

Date post:	04-Jun-2018
Category:	Documents
Upload:	cpayne10409
View:	237 times
Download:	0 times

Synchronous Data Flow Algorithms

Documents