+ All Categories
Home > Documents > 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND...

1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND...

Date post: 22-Mar-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
17
An Asynchronous Dataflow FPGA Architecture John Teifel, Student Member, IEEE, and Rajit Manohar, Member, IEEE Abstract—We discuss the design of a high-performance field programmable gate array (FPGA) architecture that efficiently prototypes asynchronous (clockless) logic. In this FPGA architecture, low-level application logic is described using asynchronous dataflow functions that obey a token-based compute model. We implement these dataflow functions using finely pipelined asynchronous circuits that achieve high computation rates. This asynchronous dataflow FPGA architecture maintains most of the performance benefits of a custom asynchronous design, while also providing postfabrication logic reconfigurability. We report results for two asynchronous dataflow FPGA designs that operate at up to 400 MHz in a typical TSMC 0:25"m CMOS process. Index Terms—Asynchronous/synchronous operation, dataflow architectures, gate arrays, reconfigurable hardware. æ 1 INTRODUCTION W E present an asynchronous dataflow FPGA architec- ture for implementing high-performance asynchro- nous logic. Asynchronous design methodologies seek to address the design complexity, energy consumption, and timing issues affecting modern VLSI design [13]. Since most experimental high-performance asynchronous designs (e.g., [2], [16]) have been designed with labor-intensive custom layout, we propose asynchronous dataflow FPGAs as an alternative method for prototyping these asynchronous systems. Asynchronous dataflow FPGA architectures use explicit message-passing channels to communicate data values between computation logic blocks. In these FPGA architec- tures, logic computations are not synchronized to a global clock signal and, hence, all logic computations proceed concurrently, with the message-passing channels enforcing synchronization necessitated by data dependencies be- tween computations. In contrast to previously proposed asynchronous FPGA architectures (e.g., [5], [17]), which ported clocked FPGA architectures to asynchronous circuit implementations, the asynchronous FPGA architecture described in this paper was specifically designed to efficiently prototype asynchronous dataflow computations. While recent work in designing high-performance pipelined FPGAs [18], [24], [25] has focused exclusively on clocked FPGAs, our work investigates pipelined FPGAs built from pipelined asynchronous circuits. These asyn- chronous circuits were inspired by high-performance, full- custom asynchronous designs [2], [16] that use very fine- grain pipelines. Each pipeline stage contains only a small amount of logic (e.g., a 1-bit full-adder) and combines computation with data latching such that explicit output latches are absent from the pipeline. We chose to implement our asynchronous dataflow FPGA architecture using fine- grain asynchronous pipelines because they achieve high computation rates and naturally support the dataflow computation model. An asynchronous dataflow FPGA architecture distin- guishes itself from that of a clocked FPGA architecture on the following criteria: . Ease of pipelining: Asynchronous pipelines enable the design of high-throughput logic cores that are easily composable and reusable, where asynchro- nous pipeline handshakes enforce correctness in- stead of circuit delays or pipeline depths as in clocked pipelines. . Event-driven energy consumption: Asynchronous logic implements perfect “clock gating” by automati- cally turning off unused circuits since the parts of an asynchronous circuit that do not contribute to the computation being performed have no switching activity. . Robustness: Asynchronous circuits are automati- cally adaptive to delay variations resulting from temperature fluctuations, supply voltage changes, and the imperfect physical manufacturing of a chip, which are increasingly difficult to control in deep submicron technologies. The remaining sections in this paper are organized as follows: We review related asynchronous FPGA work in Section 2 and summarize the salient properties of asyn- chronous logic and dataflow computations in Section 3. In Section 4, we present an overview of our asynchronous dataflow FPGA architecture and, in Section 5, we discuss the implementation details for this FPGA architecture. We analyze the circuit and pipeline performance of dataflow FPGA architectures in Section 6 and discuss logic synthesis and FPGA benchmark results in Section 7. 2 RELATED WORK Existing asynchronous FPGA architectures [5], [9], [12], [17] have been based largely on programmable clocked circuits. These FPGAs are limited to low-throughput logic applica- tions because their asynchronous pipeline stages are either built up from gate-level programmable cells (e.g., [5]) or use bundled-data pipelines that rely on interconnects controlled by delay lines (e.g., [17]). For example, a fabricated asynchronous FPGA chip using bundled-data pipelines operated at a maximum of 20 MHz in a 0.35 "m CMOS 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004 . The authors are with the Computer Systems Laboratory, School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853. E-mail: {teifel, rajit}@csl.cornell.edu. Manuscript received 2 Dec. 2003; revised 8 Apr. 2004; accepted 16 Apr. 2004. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCSI-0257-1203. 0018-9340/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society
Transcript
Page 1: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

An Asynchronous Dataflow FPGA ArchitectureJohn Teifel, Student Member, IEEE, and Rajit Manohar, Member, IEEE

Abstract—We discuss the design of a high-performance field programmable gate array (FPGA) architecture that efficiently prototypes

asynchronous (clockless) logic. In this FPGA architecture, low-level application logic is described using asynchronous dataflow

functions that obey a token-based compute model. We implement these dataflow functions using finely pipelined asynchronous circuits

that achieve high computation rates. This asynchronous dataflow FPGA architecture maintains most of the performance benefits of a

custom asynchronous design, while also providing postfabrication logic reconfigurability. We report results for two asynchronous

dataflow FPGA designs that operate at up to 400 MHz in a typical TSMC 0:25�m CMOS process.

Index Terms—Asynchronous/synchronous operation, dataflow architectures, gate arrays, reconfigurable hardware.

1 INTRODUCTION

WE present an asynchronous dataflow FPGA architec-ture for implementing high-performance asynchro-

nous logic. Asynchronous design methodologies seek toaddress the design complexity, energy consumption, andtiming issues affecting modern VLSI design [13]. Since mostexperimental high-performance asynchronous designs (e.g.,[2], [16]) have been designed with labor-intensive customlayout, we propose asynchronous dataflow FPGAs as analternative method for prototyping these asynchronoussystems.

Asynchronous dataflow FPGA architectures use explicitmessage-passing channels to communicate data valuesbetween computation logic blocks. In these FPGA architec-tures, logic computations are not synchronized to a globalclock signal and, hence, all logic computations proceedconcurrently, with the message-passing channels enforcingsynchronization necessitated by data dependencies be-tween computations. In contrast to previously proposedasynchronous FPGA architectures (e.g., [5], [17]), whichported clocked FPGA architectures to asynchronous circuitimplementations, the asynchronous FPGA architecturedescribed in this paper was specifically designed toefficiently prototype asynchronous dataflow computations.

While recent work in designing high-performancepipelined FPGAs [18], [24], [25] has focused exclusivelyon clocked FPGAs, our work investigates pipelined FPGAsbuilt from pipelined asynchronous circuits. These asyn-chronous circuits were inspired by high-performance, full-custom asynchronous designs [2], [16] that use very fine-grain pipelines. Each pipeline stage contains only a smallamount of logic (e.g., a 1-bit full-adder) and combinescomputation with data latching such that explicit outputlatches are absent from the pipeline. We chose to implementour asynchronous dataflow FPGA architecture using fine-grain asynchronous pipelines because they achieve highcomputation rates and naturally support the dataflowcomputation model.

An asynchronous dataflow FPGA architecture distin-guishes itself from that of a clocked FPGA architecture onthe following criteria:

. Ease of pipelining: Asynchronous pipelines enablethe design of high-throughput logic cores that areeasily composable and reusable, where asynchro-nous pipeline handshakes enforce correctness in-stead of circuit delays or pipeline depths as inclocked pipelines.

. Event-driven energy consumption: Asynchronouslogic implements perfect “clock gating” by automati-cally turning off unused circuits since the parts of anasynchronous circuit that do not contribute to thecomputation being performed have no switchingactivity.

. Robustness: Asynchronous circuits are automati-cally adaptive to delay variations resulting fromtemperature fluctuations, supply voltage changes,and the imperfect physical manufacturing of a chip,which are increasingly difficult to control in deepsubmicron technologies.

The remaining sections in this paper are organized asfollows: We review related asynchronous FPGA work inSection 2 and summarize the salient properties of asyn-chronous logic and dataflow computations in Section 3. InSection 4, we present an overview of our asynchronousdataflow FPGA architecture and, in Section 5, we discussthe implementation details for this FPGA architecture. Weanalyze the circuit and pipeline performance of dataflowFPGA architectures in Section 6 and discuss logic synthesisand FPGA benchmark results in Section 7.

2 RELATED WORK

Existing asynchronous FPGA architectures [5], [9], [12], [17]have been based largely on programmable clocked circuits.These FPGAs are limited to low-throughput logic applica-tions because their asynchronous pipeline stages are eitherbuilt up from gate-level programmable cells (e.g., [5]) or usebundled-data pipelines that rely on interconnects controlledby delay lines (e.g., [17]). For example, a fabricatedasynchronous FPGA chip using bundled-data pipelinesoperated at a maximum of 20 MHz in a 0.35 �m CMOS

1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004

. The authors are with the Computer Systems Laboratory, School ofElectrical and Computer Engineering, Cornell University, Ithaca, NY14853. E-mail: {teifel, rajit}@csl.cornell.edu.

Manuscript received 2 Dec. 2003; revised 8 Apr. 2004; accepted 16 Apr. 2004.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCSI-0257-1203.

0018-9340/04/$20.00 � 2004 IEEE Published by the IEEE Computer Society

Page 2: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

process [9]. In contrast, our asynchronous dataflow FPGAarchitecture is configured at the pipeline level, uses fullydelay-insensitive interconnects, and is based on high-speedasynchronous pipelined circuits. To our knowledge, ourpipelined asynchronous FPGA architecture is the firstasynchronous FPGA programmable at the pipeline leveland represents an order-of-magnitude improvement overprevious gate-level asynchronous FPGA designs.

The first asynchronous FPGA architectures (e.g., [5])included programmable arbiters, asynchronous circuits thatnondeterministically select between two competing non-synchronized signals. However, arbiters occur very rarelyin slack-elastic asynchronous systems, which includes mosthigh-performance asynchronous designs, because they canbe used only when they do not break the properties of slackelasticity [14]. For instance, an asynchronous MIPS R3000microprocessor used only two arbiters in its entire design[16]. This is the reason that we restrict our asynchronousFPGA architecture to asynchronous applications that do notuse arbiters, although it could be trivially extended toinclude arbiter circuits.

While it is possible to perform limited prototyping ofasynchronous logic on commercial clocked FPGAs, theperformance and logic density costs are large. For example,an unpipelined 1-bit asynchronous (quasi-delay-insensitive)full-adder requires 10 4-input LUT cells [7] and six additional4-input LUT cells to implement asynchronous pipelining in aclocked FPGA. Since this adder is inefficiently pipelined andcannot take advantage of built-in FPGA carry chains, it willoperate slower and occupy at least 16 timesmore logic blocksthan an equivalent clocked adder.

An alternative method for prototyping clocked logic is tomap a clocked netlist onto asynchronous blocks, such thatthe resulting asynchronous system implements the samelogical behavior as if it were a clocked system. While webelieve this method is less than ideal because clocked logicdoes not behave like asynchronous logic and need notefficiently map to asynchronous circuits, previous work hasdesigned pipelined asynchronous FPGA cells that supportthis feature [8], [23]. However, these FPGA designs usedunpipelined interconnects and did not demonstrate sig-nificant performance advantages over clocked FPGAs.

3 ASYNCHRONOUS LOGIC AND DATAFLOW

COMPUTATION

3.1 Asynchronous Pipelines

We design asynchronous systems as a collection ofconcurrent hardware processes that communicate witheach other through message-passing channels. Thesemessages consist of atomic data items called tokens. Eachprocess can send and receive tokens to and from itsenvironment through communication ports. Asynchronouspipelines are constructed by connecting these ports to eachother using channels, where each channel is allowed onlyone sender and one receiver.

Since there is no clock in an asynchronous design,processes use handshake protocols to send and receivetokens on channels. Most of the channels in our FPGA usethree wires, two data wires and one enable 1 wire, to

implement a four-phase handshake protocol (Fig. 1). Thedatawires encodebitsusingadual-rail code2 such that setting“wire-0” transmits a “logic-0” and setting “wire-1” transmitsa “logic-1”. A dual-rail encoding is a specific example of a1ofN asynchronous signaling code that uses N wires toencode N values such that setting the nthwire encodes datavalue n. The four-phase protocol operates as follows: Thesender sets one of the datawires, the receiver latches the dataand lowers the enable wire, the sender lowers all data wires,and, finally, the receiver raises the enable wire when it isready to accept new data. Since the data wires start and endthe handshake with their values in the lowered position, thishandshake is an example of an asynchronous return-to-zeroprotocol. The cycle time of a pipeline stage is the timerequired to complete one four-phase handshake. Thethroughput, or the inverse of the cycle time, is the rate atwhich tokens travel through the pipeline.

3.2 Retiming and Slack Elasticity

A slack-elastic system [14] has the property that increasingthe pipeline depth, or slack, on any channel will not changethe logical correctness of the original system. This propertyallows a designer to locally add pipelining anywhere in theslack-elastic system without having to adjust or resynthe-size the global pipeline structure of the system (althoughthis can be done for performance reasons, it is not necessaryfor correctness). While many asynchronous systems areslack elastic, including an entire high-performance micro-processor [16], any nontrivial clocked design will not beslack elastic because changing local pipeline depths in aclocked circuit often requires global retiming of the entiresystem.3 In the rest of this paper, we will consider onlyasynchronous systems that are slack elastic.

To simplify logic synthesis and channel routing, wedesign the pipelines in our asynchronous FPGAs so thatthey are slack elastic. This allows logic blocks and routinginterconnects to be implemented with a variable number ofpipeline stages whose pipeline depth is chosen forperformance and not because of correctness. More impor-tantly, in a pipelined interconnect, the channel routes can gothrough an arbitrary number of interconnect pipeline stageswithout affecting the correctness of the logic. This makesexplicit banks of retiming registers, a significant overheadin pipelined clock FPGAs [24], unnecessary.

3.3 Dataflow Computations

Logic computations in asynchronous pipelines behave likefine-grain static dataflow systems [3], where a tokentraveling through an asynchronous pipeline explicitlyindicates the flow of data. Channel handshakes ensure thatpipeline stages consume and produce tokens in sequentialorder so that new data items cannot overwrite old dataitems. In this dataflow model, data items have one producerand one consumer. Data items needed by more than oneconsumer are duplicated by copy processes that produce a

TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377

1. We use the enable signal, the logical inverse of the acknowledge signalused in traditional asynchronous handshake protocols, because it yieldssmaller circuits.

2. An alternative to dual-rail encodings is to use one normal data wireand one “data valid” wire, which indicates that there is a valid data bit onthe data wire. However, this alternative encoding introduces a delayassumption that the data-valid wire will switch after the bit on the data wirehas stabilized. There are no such timing assumptions for dual-railencodings since only one of the data wires will switch to indicate a newdata value.

3. This limitation in a clocked system can be lifted by adding valid bits toall data, which emulates an asynchronous handshake at the granularity of aclock cycle.

Page 3: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

new token for every concurrent consumer. In contrast,clocked logic uses a global clock to separate data items in apipeline, which allows data items to fan out to multiplereceivers because they are all synchronized to the clock.Furthermore, the default behavior for a clocked pipeline isto overwrite data items on the next clock cycle, regardless ofwhether they were actually used for a computation in theprevious cycle.

Consider the dataflow system, whose graph is shown inFig. 2a that implements the function yn ¼ yn�1 þ cðaþ bÞ.Fig. 2b shows the dataflow graph after it has beentransformed into an asynchronous token-based pipeline,where dataflow values are communicated asynchronouslyas tokens along the edges of the dataflow graph. Observethat forks in the pure dataflow graph are replaced by copynodes in the asynchronous pipeline and that an initial yn�1

token is needed at reset to initialize the state feedback loop.Fig. 2c shows one possible clocked implementation of thepure dataflow graph, where dataflow values are commu-nicated synchronously along the edges of the dataflowgraph. While dataflow nodes in the pure dataflow graph

turn into fine-grain pipeline stages in the asynchronousdataflow pipeline, additional clocked registers are intro-duced in the clocked dataflow pipeline to determine itspipelining explicitly. Note that a retiming register in theclocked dataflow graph is added along the input edgecontaining c so that all the inputs are sampled on the sameclock cycle, whereas, in the asynchronous dataflow graph,no retiming registers are necessary because the asynchro-nous dataflow system is slack elastic.

To synthesize logic for asynchronous dataflow architec-tures, a designer only needs to understand how to programfor this token-based dataflow computation model and is notrequired to know the underlying asynchronous pipeliningdetails. This type of asynchronous design, unlike clockeddesign, separates logical pipelining fromphysical pipelining.Dataflow nodes in an asynchronous dataflow graphrepresent logical pipeline stages, whereas each dataflownode can be built from an arbitrary number of physicalpipeline stages. An application that is verified to function-ally operate at the dataflow level is, under the properties ofslack-elastic systems, guaranteed to run on any physical

1378 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004

Fig. 2. Computation of yn ¼ yn�1 þ cðaþ bÞ: (a) pure dataflow graph, (b) token-based asynchronous dataflow pipeline (filled circles indicate tokens,

empty circles indicate an absence of tokens), and (c) clocked dataflow pipeline.

Fig. 1. Asynchronous data communication. (a) Channel representation. (b) Handshake representation. (c) Dual-rail representation. (d) 4-phase

return-to-zero protocol.

Page 4: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

implementation of the dataflow graph. For example, anasynchronous application that operates correctly with anasynchronous FPGA containing one pipeline stage per logicblock will also work correctly with an asynchronous FPGAcontaining two pipeline stages per logic block, withoutrequiring the retiming conversions needed by clockedapplications. This property simplifies the developmentand porting of applications between different asynchronousdataflow FPGA designs.

3.4 Asynchronous Dataflow Nodes

Using the dataflow nodes shown in Fig. 3, we can build anydeterministic asynchronous dataflow graph [22]. Theseasynchronous dataflow nodes behave as follows:

1. Copy: This dataflow node duplicates tokens ton receivers. It receives a token on its input channeland copies the token to all of its output channels.

2. Function: This dataflow node computes arbitraryfunctions of n variables. It waits until tokens havebeen received on all its input channels and thengenerates a token on its output channel.

3. Merge: This dataflow node performs a two-waycontrolled token merge and allows tokens to beconditionally read on channels. It receives a controltoken on channel C. If the control token has a zerovalue, it reads a data token from channel A,otherwise it reads a data token from channel B.Finally, the data token is sent on channel Z. A mergenode is similar to a clocked multiplexer except that atoken on the unused conditional input channel willnot be consumed and need not be present for themerge node to process tokens on the active inputdata channel. As shown in Fig. 4a, multiway mergeblocks can be constructed from a combination oftwo-way merge nodes, two-way split nodes, andcopy nodes.

4. Split: This dataflow node performs a two-waycontrolled token split and allows tokens to beconditionally sent on channels. It receives a controltoken on channel C and a data token on channel A. Ifthe control token has a zero value, it sends the datatoken on channel Y , otherwise it sends the data

token on channel Z. A split node is similar to aclocked demultiplexer except that no token isgenerated on the unused conditional output channelin its asynchronous implementation. As shown inFig. 4b, multiway split blocks can be built using acombination of two-way split nodes and copy nodes.

An asynchronous dataflow FPGA architecture providesefficient hardware support necessary to implement single-bit versions of the dataflow nodes shown in Fig. 3. Inaddition, an asynchronous dataflow FPGA provides token-managementmechanisms to create tokens at reset on selectedchannels, to consume (sink) extraneous tokens, and togenerate (source) tokens with constant values. By buildingmultibit dataflow nodes from these single-bit FPGA nodes,we can prototype arbitrary asynchronous dataflow graphson an asynchronous dataflow FPGA. In the next section, weexamine how to place dataflow nodes into the logic blocksof an asynchronous FPGA.

4 ASYNCHRONOUS DATAFLOW FPGAARCHITECTURES

To evaluate the feasibility of asynchronous dataflow FPGAarchitectures, we developed two different FPGA designs.Our first design [20], which we refer to as our basearchitecture, used nonclustered logic blocks, blocking carrychains, and nonpipelined interconnects. While the perfor-

TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1379

Fig. 3. Dataflow control and computation nodes (filled circles indicate tokens, empy circles indication an absence of tokens). (a) Token copy. (b) Logic

function. (c) Token merge. (d) Token split.

Fig. 4. Four-way conditional dataflow blocks, where G is a two-bit control

channel (G0 is the lower bit of the control channel and G1 is the upper

bit). (a) 4-way merge. (b) 4-way split.

Page 5: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

mance of our first dataflow FPGA was at least 10 timesfaster than the most recent fabricated asynchronous FPGA[9], we observed that its performance and logic densitycould be improved further with a redesign of the micro-architecture and circuits. Our second FPGA design [21],which we refer to as our optimized architecture, corrected themajor inefficiencies in our first design. Notably, theoptimized design included clustered logic blocks, early-out carry chains, and pipelined interconnects. A compar-ison of the architecture features in these designs is listed inTable 1.

4.1 Common Architectural Features

The asynchronous FPGAs presented in this section use“island-style” architectures that consist of logic blockssurrounded by programmable interconnect tracks. We uselogic blocks with four inputs and four outputs, equallydistributed on their north, east, south, and west edges. Arouting track is a dual-rail asynchronous channels com-posed of three wires and we assume that there are fourrouting tracks between each logic block.

Configuration of our dataflow asynchronous FPGAs isdone using clocked SRAM-based circuitry. This allows us totake advantage of the same configuration schemes used inclocked FPGAs. In our FPGA designs, we constructed thesimplest configuration method using shift-register type

configuration cells that connect serially throughout theFPGA chip. During programming, the asynchronous por-tion of the logic is held in a passive reset state while theconfiguration bits are loaded. The configuration clocks aredisabled after programming is complete and the asynchro-nous logic is taken out of reset and allowed to operate.

4.2 Base Dataflow FPGA Architecture

Fig. 5 shows the logic block and interconnect architectureused in the base dataflow FPGA architecture.

4.2.1 Logic Block

The pipeline structure of a logic block in our base FPGAarchitecture is shown in Fig. 5. A nonpipelined input switchroutes channels from the four physical input ports (Nin,Ein, Sin, Win) to three internal logical channels (A;B;C),which provide input tokens to the logic core. When aninternal channel is not connected to one of the input ports, aconstant token source (not shown in the figure) is connectedinstead to prevent the logic core from deadlocking due tolack of tokens.

While the logic core contains four distinct logical units, it isnot a true clustered logic block because only one of the logicalunits can be enabled per logic core. The function unit isprimarily used to compute arbitrary functions of up to threeinput variables using a 3-input LUT. To support carry chains

1380 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004

TABLE 1Comparison of Base and Optimized Asynchronous Dataflow FPGA Designs

Page 6: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

and to avoid additional circuit layout, we added anotherthree-input LUT to the function unit to compute carry-outvalues. This carry chain design is blocking because its carrystages produce a carry-out token only after receiving a carry-in token, even when the carry-out computation does notdepend on the value of the carry-in. Such a carry chain islimited, however, by the delay of rippling a carry along thefull length of the carry chain. The merge and split units arecircuit implementations of the two-way merge and splitdataflow nodes described in Section 3.4. The state unitgenerates a single token at reset, after which it behaves like asimple FIFO pipeline and is used to initialize tokens inasynchronous dataflow computations.

The logic core sends its result tokens to two outputcopy pipeline stages, which duplicates and routes thesetokens to the logic block’s physical output ports(Nout; Eout; Sout;Wout) as specified by the applicationlogic. Output copy stages can also sink result tokens beforethey reach the output ports when they are not needed byother logic blocks (e.g., frequently only one output token ofthe split unit is needed).

4.2.2 Interconnect

Our base FPGA architecture uses nonpipelined, unidirec-tional channel switches to statically route tokens betweenlogic blocks. We call this architecture pseudo-island-stylebecause arbitrary corner routing is not possible since thereare no switch boxes. Due to the token-based nature ofasynchronous logic design, where tokens have unique

senders and receivers, all channel routes in this asynchro-

nous interconnect architecture are point-to-point and do not

fanout to multiple receivers. Since result tokens needed by

more than one logic block are duplicated at the output copy

stage and consequently routed on different channels,

asynchronous logic may require more local routingresources than a clocked design. However, this local routing

overhead may be offset by the lack of global routing

resources (e.g., global signals and clocks) that are not

needed in asynchronous designs.This interconnect topology was designed primarily to

evaluate the circuit-level performance of our pipelined logic

blocks and it is not practical, due to its lack of switchboxes, for

large logic designs. For example, the unidirectional channel

switch style of interconnect is incompatible with most

existing island-style place and route FPGA tools that assumea bidirectional connection box and switch box interconnect

architecture.More significantly, since these channel switches

are not pipelined, they limit the performance of the overall

system, even for logic with short routes.

4.3 Optimized Dataflow FPGA Architecture

After implementing the base dataflow FPGA architecture,

we felt there was sufficient architectural, circuit, and logic

density inefficiencies to justify a redesign. More precisely,

we implemented the following improvements in our

optimized design, which is shown in Fig. 6.

TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1381

Fig. 5. Base asynchronous dataflow FPGA architecture. (a) Pseudo-island-style architecture with nonpipelined channel switches. (b) Pipelined logic

block (only one computation unit can be enabled).

Page 7: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

1. Pipelined Interconnect: To take advantage of open-source FPGA place and route tools, we adopted astandard island-style interconnect topology that usesconnection boxes and switch boxes. Since thethroughput in our base design was limited by itsnonpipelined interconnect, we improved the perfor-mance of our optimized design by using pipelinedswitch boxes.

2. Deeply Pipelined Logic: To maintain high throughputinside of the logic block, we added buffer pipelinestages on the input channels to decouple theconnection box switches from the internal routing

muxes. Similar to the state unit, these input pipeline

stages are also capable of optionally generating

tokens at reset. Improved circuit techniques allowed

the two output copy stages in the base design to be

merged into a single pipeline stage in our optimized

architecture, decreasing the output copy area by

35 percent. Instead of implementing token sinks in

the output copy stage, we added a dedicated sink

unit that lets the output copy route tokens from

more logical units. Finally, we added internal

routing channels that bypass the function unit,

1382 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004

Fig. 6. Optimized asynchronous dataflow FPGA architecture. (a) Island-style architecture. (b) Nonpipelined connection boxes. (c) Pipelined switch

box. (d) Pipelined switch point (two WHCBs per switch point). (e) Pipelined logic block (multiple computation units can operate concurrently).

Page 8: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

which reduces the pipeline latency through the logicblock for pure copy logic.

3. Clustered Logic Blocks: Clustered logic blocks allowmultiple logical units to be active in the same logicblock. Although we found clustered logic blocks tobe 25 percent larger than nonclustered logic blocks, aclustered FPGA requires approximately 33 percentfewer logic blocks than a nonclustered FPGA toimplement the same asynchronous logic application.

4. 4-Input LUTs: To improve logic density and tosimplify the remapping of clocked applications thatwere optimized for commercial FPGAs, we in-creased the LUT size in the function unit to handlearbitrary functions of four input variables.

5. Early-out Carry Logic: When a carry-out computationis not dependent on the value of its carry-in, early-out carry chain stages generate a carry-out withoutwaiting to receive the carry-in. Early-out carrychains allow the construction of asynchronousripple-carry adders that exhibit average-case beha-vior in their carry chains, in contrast to the worst-case behavior of clocked ripple carry adders. We alsoadded dedicated carry chain routing channels tofurther improve the performance of the carry chainsin our optimized FPGA.

6. Internal State Feedback: In the base dataflow FPGAarchitecture, function units needing state feedbackrequired an additional logic block configured withits state unit enabled. To remove this inefficiency, weadded routing switches inside of the logic block thatallow the function unit to internally feed back onitself through the state unit. The function unit, in ouroptimized design, is also able to internally route itsoutput tokens to the conditional unit.

7. Unified Conditional Unit: For the logic blocks underconsideration in this paper, namely, those with fourindependent input channels and two independentoutput channels, it is impossible to concurrentlyimplement both a full merge unit (three inputs, oneoutput) and a full split unit (two inputs, twooutputs) in the same logic block. In our optimizedarchitecture, we instead combine the merge and splitunit into a single conditional unit that is 40 percentsmaller than the separate merge and split units usedin the base design [21].

5 PROGRAMMABLE ASYNCHRONOUS PIPELINES

The class of asynchronous circuits that we used toimplement our asynchronous dataflow FPGA designs arequasi-delay-insensitive (QDI). QDI circuits are designed tooperate correctly under the assumption that gates and wireshave arbitrary finite delay, except for a small number ofspecial wires, known as isochronic forks [15], that can safelybe ignored for the circuits in this paper. Although we sizetransistors to adjust circuit delays and, hence, theirperformance, this only affects the performance of a circuitand not its correctness.

5.1 Pipelined Asynchronous Circuits

Fine-grain pipelined circuits are critical to the efficientimplementation of high-throughput logic in an asynchro-nous dataflow FPGA architecture. Fine-grain pipelinescontain only a small amount of logic (e.g., a 1-bit adder)and combine computation with data latching, removing theoverhead of explicit output registers [11]. While thispipeline style has been used in several high-performanceasynchronous designs [2], including a microprocessor [16],we were the first to adopt these circuits for programmableasynchronous logic applications.

Fig. 7 shows the two types of asynchronous pipelinesthat are used in our asynchronous dataflow FPGA designs.A weak-condition half-buffer (WCHB) pipeline stage is thesmaller of the two circuits and is most useful for tokenbuffering and token copying. A precharge half-buffer(PCHB) pipeline stage has a precharge pull-down stackoptimized for performing fast token computations, wherethe pc signal behaves analogous to a precharge clock insynchronous domino logic. The half-buffer notation indi-cates that a handshake on the receiving channel, L, cannotbegin again until the handshake on the transmittingchannel, R, is finished [11]. A full-buffer pipeline stage isable to overlap the beginning of the receiving handshakewith the end of the transmitting handshake, but we do notuse them in our FPGAs because they require more circuitarea. Since WCHB and PCHB pipeline stages have the samedual-rail channel interfaces, they can be composed togetherand used in the same pipeline.

5.2 Asynchronous FPGA Circuits

We used WCHB pipeline stages to implement the inputpipeline stages, the state units, the output copies, and the

TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1383

Fig. 7. Asynchronous pipeline stages (the Muller C-element is a state-holding circuit that goes high when all its inputs are high and goes low when all

its inputs are low). (a) Weak-condition helf-buffer (WHCB) circuit. (b) Precharge half-buffer (PCHB) circuit.

Page 9: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

pipelined switch boxes in our dataflow FPGA architecture.The split, merge, and function units are composed primarilyof PCHB pipeline stages.

Designing high-throughput asynchronous lookup tables(LUTs)was particularly challenging. Fig. 8 shows the 3-inputLUT design we used in our base FPGA architecture, wherethe LUT is implemented as amonolithic pull-down stack in asinglePCHBstage.A,B, andCare the input channels andP0::7

(and their inverse) are the configuration bits that program theoutput function.Wecompensated for charge sharing issues inthe pull-down stack by using aggressive transistor foldingand internal-node precharging techniques.

However, because of its pull-down stack complexity andnoise problems, we were forced to abandon this monolithic-style LUT when designing the 4-input function unit in ouroptimized FPGA architecture (shown in Fig. 9). In thisfunction unit, we divided the monolithic LUT into anaddress decode stage and an indexed table lookup stage,making it easier to obtain a high-throughput function unit.The address decode stage reads the four input channels andgenerates a 1of16 encoded address, which indexes into a4-input LUT circuit that uses a modified PCHB circuit. Sincethere are 16 memory elements in the 4-input LUT and the1of16-encoded address guarantees no sneak paths willoccur in the pull-down stack, this asynchronous LUT circuitcan use a virtual ground generated from the “_pc” signalinstead of the foot transistors used in a normal PCHB

pipeline stage. The remaining pipeline stages4 in thefunction unit implement pipelined carry logic that isfunctionally similar to the nonpipelined carry logic usedin the Xilinx Virtex2 FPGA [27]. The CMUX stageimplements early-out carry chains that, when the LUToutput is zero, generate a carry-out token without waitingto receive the carry-in token.

The output copy stage was nontrivial to design because itsupports two concurrent token copies using only fouroutput channels such that each output channel can only beused by one of the copies. Our base FPGA architecture useda brute-force circuit approach, shown in Fig. 10a, andimplemented the output copy using two programmablefour-way copy stages. Each of these programmable copieshas four WCHB pipeline stages to buffer the tokens beingcopied and a completion tree that combines the enables fromthe four output channels into a single enable for the inputchannel. Output copies are programmed by configuring themuxes on the WCHB output channels, where each pro-grammable copy stage can copy tokens on up to fourchannels. A programmable copy can also be configured as atoken sink by turning off all of its output muxes, preventingany of its input tokens from reaching the output channels.Although this output copy implementation does not limitthe throughput of the logic block, it occupies more area than

1384 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004

Fig. 9. Function unit in optimized dataflow FPGA architecture. (a) Address decoder, LUT, and carry logic. (b) 4-input asynchronous LUT (PCHB

circuit).

Fig. 8. Three-input asynchronous LUT in base dataflow FPGA architecture (the PCHB handshake control circuits are not shown).

4. Unpipelined token copy stages, which duplicate tokens inside of thefunction unit, are described in the Appendix (which can be found on theComputer Society Digital Library at http://computer.org/tc/archives.htm).

Page 10: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

necessary because at most four of the eight WCHB pipelinestages will be configured to output tokens. Since data railsin the input channels (Ad;Bd) drive all four WCHB stagesand the entire completion tree always switches, regardlessof how many WCHB stages are configured to outputtokens, there is a constant energy overhead in using thesecopy circuits.

In our optimized dataflow FPGA architecture, weredesigned the output copy circuits to address these areaand energy inefficiencies. Instead of using two program-mable four-way copy stages and muxing their outputchannels, we moved the muxes to the input channels anddesigned a single programmable four-way copy stage. Thisoptimized copy stage, shown in Fig. 10b, is 35 percentsmaller than the copy stage in the base design and uses twoprogrammable completion trees. A programmable completiontree is built from programmable C-elements (described inthe Appendix, which can be found on the Computer SocietyDigital Library at http://computer.org/tc/archives.htm)that examine only the channel enable signals that areinvolved a copy operation, whereas a normal completiontree examines all of the channel enable signals. Thus, theenergy consumption of the optimized output copy circuit

scales with the number of output channels that are beingused by a logic block.5

5.3 Physical Design

Using conservative SCMOS design rules, we laid out boththe base and optimized FPGA designs in TSMC’s 0:25�mCMOS process, which is available through MOSIS. Table 2summarizes the implementation differences between thebase and optimized dataflow FPGA designs. While agreater number of configuration bits makes the optimizeddesign 30 percent larger than the base design, the optimizeddesign has greater logic density because multiple dataflownodes can share a single FPGA logic block.

After running SPICE on the extracted layout, we foundthat both asynchronous FPGA designs have peak inter-logic-block operating frequencies of approximately 400MHz. Theperformance of the base design, however, is limited by itsnonpipelined interconnect and logic performance willquickly degrade for routes going through more than onechannel switch. For example, the booth multiplier core inTable 2 operates at 222 MHz, 45 percent slower than the

TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1385

Fig. 10. Output copy circuits. (a) Base dataflow FPGA architecture. (b) Optimized dataflow FPGA architecture.

5. Since the optimized design has a dedicated sink unit, its copy stageonly consumes energy when performing a copy.

Page 11: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

FPGA’s peak speed, because it contains a channel route thatgoes through six channel switches. To remove thisperformance bottleneck, we pipelined the interconnect inour optimized design to ensure that logic will operate closeto the peak frequency of the FPGA, whose improvementcan be seen in Table 2 for the two small arithmetic circuits.The base FPGA design was simulated exclusively in SPICE,which restricted us to these very small logic circuits. For ouroptimized FPGA architecture, however, we first used SPICEto obtain delay values from the extracted layout and thenused these values to back-annotate a detailed asynchronousswitch-level simulator of the FPGA. This allowed us toaccurately evaluate the FPGA’s performance on largeasynchronous benchmark circuits (described in Section 7.1).

The maximum energy consumption for a logic blockconfigured with its function unit enabled is 26 pJ/cycle inthe base design and 18 pJ/cycle in the optimized FPGAdesign.6 In addition, the interconnect energy consumptionper pipelined switch point is 3 pJ/cycle in the optimizeddesign. While the optimized design is more deeplypipelined, it consumes less power than the base designbecause of the circuit optimizations that were discussed inthe previous section. We caution that these energy estimatesare conservative due to our preliminary layout and we haveyet to optimally size the transistors on noncritical circuitpaths. Since asynchronous circuits do not glitch andconsume power only when they contain data tokens, thepower consumption of an asynchronous FPGA is eventdriven and is proportional to the number of tokenstraveling through the system. In contrast, a synchronouslogic block consumes clock and potentially glitch power on

every clock cycle, regardless of whether its output is usedon a particular clock cycle.

Due to their dissimilar logic block and interconnectresources, it is difficult to compare our asynchronousdataflow FPGA architecture with synchronous FPGA de-signs. Nonetheless, in Table 3, we compare our optimizedasynchronous dataflow FPGAwith recent high-performancesynchronous FPGAs. The Xilinx Virtex2 [27] is a commer-cial clocked FPGA with traditional unpipelined intercon-nects. HSRA [24] and SFRA [25] are research FPGA designsthat add pipelining resources to the logic blocks andinterconnect of a clocked FPGA. The FPGA sizes in Table 3are normalized to a tile containing a 4-input LUT logic blockand an amortized portion of the interconnect. To provide afair resource comparison, we scaled the interconnect area inour optimized asynchronous dataflow FPGA design forinterconnects with 15, 20, and 30 horizontal and verticalrouting tracks. An open question that we do not address inthis paper is whether an asynchronous dataflow FPGArequires more or less interconnect tracks than a clockedFPGA. After adjusting for process differences, our asyn-chronous dataflow FPGA designs are about twice as fast ascurrent clocked FPGAs. However, this performance isachieved with a 2.1x to 6.6x cost in area over the commercialVirtex2 FPGA. Upward of 71 percent of the area overhead isa result of the highly pipelined logic blocks and interconnectand is comparable to the area overheads in the pipelined HSRAand SFRA architectures. The remaining area overhead in ourasynchronous dataflow FPGA architecture is most likely dueto the dual-rail domino circuit implementation and the three-wire asynchronous interconnect tracks.

6 PERFORMANCE ANALYSIS

Since the base FPGA design was our first experience withdataflow FPGA architectures, we primarily focused onimplementing the new circuits needed for an asynchronous

1386 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004

TABLE 2Physical Implementation of Dataflow FPGA Architectures

6. The absolute energies reported by our asynchronous SPICE simulatorhave not been validated and are suspected to be higher than the actualvalues, however, we deem that the relative energies are reasonable tocompare against each other.

Page 12: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

dataflow FPGA and did not concentrate on performanceanalysis and optimization. Afterward, we realized that thebase architecture’s nonpipelined interconnect significantlylimited its performance because a route could go through asignificant number of channel switches, degrading itsperformance. Using the results of this section, we adopteda circuit guideline for the logic block and interconnectcircuits in our optimized FPGA architecture that allowed atmost two pass-gate switches between the asynchronouspipeline stages, which was chosen to ensure high-through-put programmable logic and to maintain good signalintegrity through the pass-gate switches. This guidelineforced us to pipeline the switch boxes in the optimizedFPGA interconnect, significantly improving its throughputon long channel routes.

6.1 Circuit Performance

The pipeline dynamics of asynchronous pipelines, due totheir interdependent handshaking channels, are quitedifferent from the dynamics of clocked pipelines. Tooperate at full throughput, a token in an asynchronouspipeline must be physically spaced across multiple pipelinestages, whereas, in a clocked pipeline, the optimum resultswhen there is one token per stage [26]. We define n0 as theoptimal number of pipeline stages that a token shouldoccupy to achieve maximum throughput. For circuits usedin our FPGA designs, n0 ranges from five to eight pipeline

stages per token (for pipelines without switches). If apipeline has fewer stages per token than n0, it will operateat a slower than maximal frequency but consume lessenergy [19]. On the other hand, if the pipeline has morestages per token than n0, it will both operate slower andconsume more energy than the optimal case.

To observe the pipeline dynamics when there areprogrammable switches between pipeline stages, we mod-eled an asynchronous FPGA pipeline as a linear pipeline ofn weak-condition pipeline stages that contain a variablenumber of routing switches between each pipeline stage.This model uses layout from the state unit of the base FPGAdesign, has n0 ¼ 5, and measures all results from full SPICEsimulations. This model gives an upper bound on theperformance of asynchronous FPGA pipelines and showsthe behavioral trends of inserting switches between fine-grain asynchronous pipeline stages.

Fig. 11a shows the maximum operating frequency curvesfor our model pipeline when there are K routing switchesbetween every pipeline stage (K ¼ 0 is the “custom” casewhen there are no switches between stages). We observethat, as K increases, n0 decreases from five stages to fourstages and the frequency curves shift downward becausethe switches uniformly increase the cycle time of everypipeline stage. Fig. 11b shows the effect of one pipelinestage having a long route through L switches (when theother pipeline stages have no switches). In this case, the

TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1387

TABLE 3Asynchronous Dataflow FPGA Comparison with Clocked FPGAs

Fig. 11. Maximum operating frequency curves for one token in a linear pipeline of n weak-condition pipeline stages when (a) there are K routing

switches between every pipeline stage and (b) one pipeline stage has a long route through L switches.

Page 13: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

frequency curves flatten as L increases because the cycletime of the pipeline is mainly determined by the cycle timeof the stage containing the long route (i.e., the long routebehaves as a pipeline bottleneck).

In addition to decreasing their operating frequency, theenergy consumption of asynchronous pipelined circuitsalso increases when routing switches are added betweenpipeline stages. To observe the energy effect of addingswitches to asynchronous pipelines, we use the E�2 energy-time metric [16], [19]. E is the energy consumed in thepipeline per cycle and � is the cycle time (1=f). Since E isproportional to V 2 and � is proportional to 1=V , to firstorder this metric is independent of voltage and provides anenergy-efficiency measure to compare both low-powerdesigns (low voltage) and high-performance designs (nor-mal voltage). Fig. 12 shows energy-efficiency curves for ourmodel pipeline under the two switch scenarios examinedearlier (lower values imply more energy efficiency). We cansee that, as the number of switches increases betweenpipeline stages, the optimal n0 in terms of energy efficiencydecreases.

While the plots in this section show that (as expected)adding routing switches to full-custom, high-throughputpipelined circuits decreases both their speed and energyefficiency, they also show that there is still much perfor-mance remaining (� 50%) to make them attractive for high-speed programmable asynchronous logic. The maximumoperating frequency and energy-efficiency curves for anasynchronous FPGA pipeline will look like a mixture of thetwo switch scenarios we investigated since their pipelinestages may have a varying number of switches betweenthem depending on the particular FPGA implementation.

6.2 Pipeline Performance

In the optimized dataflow FPGA interconnect, a channelconnecting two logic blocks can be routed through anarbitrarynumberofpipelinedswitchboxeswithout changingthe correctness of the resulting logic system since ourasynchronous FPGAs are slack elastic. However, the systemperformance can still decrease if a channel is routed throughalarge number of switch boxes. To determine the sensitivity ofchannel route lengths on pipelined logic performance, wevaried the number of switch boxes along a route for typicalasynchronous pipelines. The numbers in this section wereobtained from the detailed switch-level simulator of ouroptimized dataflow FPGA architecture [21].

Fig. 13a shows the performance of a branching linearpipeline using pipelined switch boxes and the FPGA logicblocks configured as function units. Tokens are copied inlogic block LB1, travel through both branches of thepipeline, and join in logic block LB2. Since the speed ofthe function unit is the throughput-limiting circuit in theoptimized dataflow FPGA design, this pipeline topologygives an accurate measure for linear pipeline performance.The frequency curve shows that an asynchronous pipelinedinterconnect can tolerate a relatively large pipeline mis-match (four switch boxes) before the performance begins togradually degrade. This indicates that, as long as branchpathways have reasonably matched pipelines, we do notneed to exactly balance the length of channel routes with abank of retiming registers. In contrast, in clocked FPGAs, itis necessary for correctness to exactly retime synchronoussignals routed on pipelined interconnects using banks ofretiming registers [24].

Fig. 13b shows the performance trends for token-ringpipelines using the FPGA logic blocks configured asfunction units such that one token is traveling around thering. For pipelined interconnects, adding switch box stagesto a token-ring will decrease its performance and indicatesthat the routes of channels involved in token-rings shouldbe made as short as possible. The frequency curves inFig. 13b are worst-case because all the logic blocks wereconfigured with their function units enabled, requiring atoken to travel through five pipeline stages per logic block.If the logic blocks were instead configured to use theconditional unit or the low-latency copies, then the token-ring performance would approach the performance of alinear pipeline because a token would travel through fewerpipeline stages. In addition, token rings used to hold statevariables can often be implemented using the state unit,which localizes the token ring inside the logic block and hasthe same throughput as a linear pipeline.

7 APPLICATIONS

Logic synthesis for an asynchronous dataflow FPGA followssimilar formal synthesis methods to those used in the designof full-custom asynchronous circuits [13], whose steps areshown in Fig. 14. We begin with a high-level sequentialspecification of the logic and apply semantics-preservingprogram transformations to partition the original specifica-tion into high-level concurrent function blocks. The functionblocks are further decomposed into sets of fine-grain, highly

1388 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004

Fig. 12. Energy-efficiency curves for one token in a linear pipeline of n weak-condition pipeline stages when (a) there areK routing switches between

every pipeline stage and (b) one pipeline stage has a long route through L switches.

Page 14: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

concurrent processes that are guaranteed to be functionallyequivalent to the original sequential specification. Tomaintain tight control over performance, this decomposi-tion step is usually done manually in full-custom designs.However, for FPGA logic synthesis, we developed aconcurrent dataflow decomposition [22] method that automa-tically produces fine-grain processes by detecting andremoving all unnecessary synchronization actions in thehigh-level logic specification. The resulting fine-grainprocesses are small enough (i.e., bit-level) to be implemen-ted directly by the logic blocks of the asynchronousdataflow FPGA architectures. Currently, the place/routeand configuration steps are automated7 and we are workingto automate the logic packing step.

Fig. 15 illustrates the logic synthesis steps for anasynchronous dataflow FPGA architecture. We begin witha sequential functional specification, where upper-casenames indicate channel interfaces and lower-case namesrepresent local variables.8 Reading a value from channel Xand storing it in variable x is written as “X?x” and sendingthe value of variable x onto channel X is written as “X!x.”We designed a dataflow-style logic compiler [22] thattransforms these high-level variables into the data tokensof a concurrent asynchronous dataflow graph whose data-flow computation is functionally equivalent to the originalsequential specification. Finally, the nodes in this dataflowgraph are mapped onto the logic blocks of an asynchronousdataflow FPGA.

Alternatively, we can begin with a logic application thathas been optimized for a clocked FPGA andmanually map itto an asynchronous dataflow FPGA. This methodworks bestfor theoptimizeddataflowFPGAarchitecturebecause it hasa4-input LUT and carry logic architecture similar to the XilinxVirtex2 FPGA [27]. For example, consider the scaling

accumulator, shown in Fig. 16a, that is commonly used inFPGA-basedFIR filters [4]. The scalingaccumulator normallysums its input with a shifted value of its accumulator and itsoutput is ignored. When the Clear control signal is asserted,however, the previous accumulator value is cleared and thefinal sum is read on the output. To convert the clocked FPGAimplementation, shown in Fig. 16b,we added a copynode forthe shared control signal and added split nodes to make theoutput value conditional, whose final asynchronous FPGAimplementation is shown in Fig. 16c. For the benchmarkapplications used in this paper, which are not highlyoptimized, approximately 20-33 percent of the logic blocksare used only to copy tokens. This copy overhead, ascompared to equivalent synchronous implementations, canbe reduced by aggressively using the low-latency copies inthe logic block to integrate copy operations with othercomputations.

7.1 Benchmarks

Using our optimized asynchronous dataflow FPGA design,we synthesized a variety of benchmark circuits that wereused in previous clocked and asynchronous designs. Thebenchmarks in Table 4 are classified into three categories:arithmetic, signal processing, and random circuits. Ap-proximately half of these benchmarks were optimized forFPGA implementations (e.g., scaling accumulator, FIR filtercore,9 and cross-correlator) and the other half were notdeveloped specifically for FPGAs (e.g., booth multiplier,systolic convolution, and write-back unit).

We placed and routed these benchmarks using VPR [1]and used its default settings, except to disable timing-driven optimizations because they assume a clockedarchitecture. Since VPR does not support macro placement,we hand placed several benchmarks so that they could usethe fast south-to-north carry chains. We measured theperformance of these benchmarks with our asynchronousswitch-level FPGA simulator. While we made no effort to

TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1389

Fig. 13. Pipeline performance of the optimized dataflow FPGA architecture. (a) Linear pipelines. (b) Token-ring pipelines.

7. Since the base FPGA design does not use a true island-stylearchitecture, it has less place and route support than the optimized FPGAdesign.

8. We assume, without loss of generality, that the variables in thisexample are single-bit.

9. Contains only the serial adders, LUT-based multipliers, and scalingaccumulator of the FIR filter [4].

Page 15: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

equalize branch mismatches or to minimize routes ontoken-ring pipelines and other latency-critical channels,most of the benchmarks performed within 75 percent of theFPGA’s maximum throughput. In contrast, pipelined clockFPGAs require substantial CAD support beyond thecapabilities of generic place and route tools to achieve suchperformance [18], [24]. For example, our FPGA’s perfor-mance is 36 percent slower than hand-placed benchmarksrunning on a “wave-steered” clocked FPGA [18], but almosttwice as fast for automatically placed benchmarks.

Our asynchronous dataflow FPGA inherently supportsbit-pipelined datapaths that allow datapath bits to becomputed concurrently, whereas clocked FPGAs imple-ment aligned datapaths that compute all datapath bitstogether. However, due to bit-level data dependencies andaligned datapath environments (e.g., memories or off-chipI/O), a bit-pipelined datapath in an asynchronous FPGAwill behave in between that of a fully concurrent bit-pipelined datapath and a fully aligned datapath. Toevaluate such a datapath fairly, we measured the perfor-mance of a 16-bit adder with a fully bit-pipelined environ-

ment and a fully aligned environment. Since the fullyaligned adder datapath will exhibit data-dependent carrychain behavior, we reported the best-case and worst-casethroughputs in Table 4. In the worst case, a fully alignedadder is 7 percent slower than a fully bit-pipelined adderusing fast carry chains and 47 percent slower using slowcarry chains.

8 CONCLUSION AND FUTURE DIRECTIONS

We introduced a new high-performance asynchronousdataflow FPGA architecture, the first such asynchronousarchitecture that is programmable at the pipeline levelrather than at the gate level. We demonstrated thefeasibility of designing highly pipelined asynchronous logicblocks, pipelined asynchronous carry logic, and pipelinedasynchronous interconnects. Through detailed circuit andlogic simulations, we have shown that this asynchronousdataflow FPGA architecture is a promising new way ofprototyping asynchronous systems and is competitive withhigh-speed clocked FPGAs.

In this paper, we explored only a small realm of possibleasynchronous FPGA architectures, that of an island-styleFPGA with a nonsegmented pipelined interconnect. Asegmented interconnect, where routing channels span morethan one logic block, or a hierarchical interconnect (e.g., atree structure) could be used to improve scalability byreducing pipeline latency on long channel routes. Inaddition, advanced hardware structures found in commer-cial FPGAs (e.g., cascade chains, LUT-based RAM, etc.)could be added to our asynchronous FPGA to improveperformance and logic density.

QDI asynchronous circuits are very conservative circuitsin terms of delay assumptions and, in that regard, theresults presented in this paper are the “worst” performancewe can achieve with asynchronous FPGA circuits. If we usemore aggressive circuit techniques that rely on delayassumptions, then it is feasible to design faster and smallerasynchronous FPGAs at the cost of decreased circuitrobustness.

1390 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004

Fig. 14. Asynchronous synthesis flow.

Fig. 15. Asynchronous logic synthesis example: (a) sequential specification, (b) an equivalent dataflow graph of fine-grain asynchronous logic

processes, and (c) one possible asynchronous logic block mapping onto the optimized dataflow FPGA design.

Page 16: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

While we concentrated on designing bit-level FPGAs,

asynchronous circuits may be more area and energy

efficient with multibit programmable datapaths. These

datapaths consist of small N-bit ALUs, which are inter-

connected by N-bit wide asynchronous channels that use

more efficient 1ofN data encodings. For example, a 1of4

channel will use one less wire than two dual-rail channels

and consume half as much interconnect switching energy.

TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1391

TABLE 4Benchmark Statistics for Optimized Dataflow FPGA Architecture

Fig. 16. Implementing scaling accumulators in FPGAs. (a) Scaling accumulator. (b) Clocked implementation. (c) Asynchronous implementation.

Page 17: 1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11 ...mingjie/ECM6308/papers/An... · TEIFEL AND MANOHAR: AN ASYNCHRONOUS DATAFLOW FPGA ARCHITECTURE 1377 1. We use the enable signal,

ACKNOWLEDGMENTS

This research was supported in part by the Multidisciplin-

ary University Research Initiative (MURI) under the US

Office of Naval Research Contract N00014-00-1-0564 and in

part by a US National Science Foundation (NSF) CAREER

award under contract CCR-9984299. John Teifel was

supported in part by an NSF Graduate Research Fellow-

ship. A preliminary version of this paper discussing the base

asynchronous dataflow FPGA architecture appeared in the

Proceedings of the 13th International Conference on Field

Programmable Logic and Applications as “Programmable

Asynchronous Pipeline Arrays” in September 2003 and a

preliminary version of this paper discussing the optimized

asynchronous dataflow FPGA architecture appeared in the

Proceedings of the 12th ACM International Symposium on Field-

Programmable Gate Arrays as “Highly Pipelined Asynchro-

nous FPGAs” in February 2004.

REFERENCES

[1] V. Betz and J. Rose, “VPR: A New Packing, Placement, andRouting Tool for FPGA Research,” Proc. Int’l Workshop FieldProgrammable Logic and Applications, 1997.

[2] U.V. Cummings, A.M. Lines, and A.J. Martin, “An AsynchronousPipeline Lattice-Structure Filter,” Proc. Int’l Symp. AsynchronousCircuits and Systems, 1994.

[3] J.B. Dennis, “The Evolution of ’Static’ Data-Flow Architecture,”Advanced Topics in Data-Flow Computing, J.-L. Gaudiot and L. Bic,eds., Prentice Hall, 1991.

[4] G. Goslin, “A Guide to Using {FPGAs for Application-SpecificDigital Signal Processing Performance,” Xilinx Application Notes,1995.

[5] S. Hauck, S. Burns, G. Borriello, and C. Ebeling, “An FPGA forImplementing Asynchronous Circuits,” IEEE Design and Test ofComputers, vol. 11, no. 3, pp. 60-69, 1994.

[6] B. Von Herzen, “Signal Processing at 250 Mhz Using High-Performance FPGAs,” Proc. Int’l Symp. Field Programmable GateArrays, 1997.

[7] Q.T. Ho, J.-B Rigaud, L. Fesquet, M. Renaudin, and R. Rolland,“Implementing Asynchronous Circuits on LUT Based FPGAs,”Proc. Int’l Conf. Field Programmable Logic and Applications, 2002.

[8] D.L. How, “A Self Clocked FPGA for General Purpose LogicEmulation,” Proc. IEEE Custom Integrated Circuits Conf., 1996.

[9] R. Konishi, H. Ito, H. Nakada, A. Nagoya, K. Oguri, N. Imlig, T.Shiozawa, M. Inamori, and K. Nagami, “PCA-1: A FullyAsynchronous Self-Reconfigurable LSI,” Proc. Int’l Symp. Asyn-chronous Circuits and Systems, 2001.

[10] S.Y. Kung, VLSI Array Processors. Prentice Hall, 1988.[11] A. Lines, “Pipelined Asynchronous Circuits,” master’s thesis,

California Inst. of Technology, 1995.[12] K. Maheswaran, “Implementing Self-Timed Circuits in Field

Programmable Gate Arrays,” master’s thesis, Univ. of CaliforniaDavis, 1995.

[13] R. Manohar, “A Case for Asynchronous Computer Architecture,”Proc. ISCA Workshop Complexity-Effective Design, 2000.

[14] R. Manohar and A.J. Martin, “Slack Elasticity in ConcurrentComputing,” Proc. Int’l Conf. Math. of Program Construction, 1998.

[15] A.J. Martin, “The Limitations to Delay-Insensitivity in Asynchro-nous Circuits,” Proc. Conf. Advanced Research in VLSI, 1990.

[16] A. J. Martin, A. Lines, R. Manohar, M. Nystrom, P. Penzes, R.Southworth, U.V. Cummings, and T.-K. Lee, “The Design of anAsynchronous MIPS R3000,” Proc. Conf. Advanced Research in VLSI,pp. 164-181, Sept. 1997.

[17] R. Payne, “Asynchronous FPGA Architectures,” IEE Computersand Digital Techniques, vol. 143, no. 5, 1996.

[18] A. Singh, A. Mukherjee, and M. Marek-Sadowska, “InterconnectPipelining in a Throughput Intensive FPGA Architecture,” Proc.Int’l Symp. Field Programmable Gate Arrays, 2001.

[19] J. Teifel, D. Fang, D. Biermann, C. Kelly, and R. Manohar,“Energy-Efficient Pipelines,” Proc. Int’l Symp. Asynchronous Cir-cuits and Systems, Apr. 2002.

[20] J. Teifel and R. Manohar, “Programmable Asynchronous PipelineArrays,” Proc. Int’l Conf. Field Programmable Logic and Applications,Sept. 2003.

[21] J. Teifel and R. Manohar, “Highly Pipelined AsynchronousFPGAs,” Proc. Int’l Symp. Field Programmable Gate Arrays, Feb.2004.

[22] J. Teifel and R. Manohar, “Static Tokens: Using Dataflow toAutomate Concurrent Pipeline Synthesis,” Proc. Int’l Symp.nAsynchronous Circuits and Systems, Apr. 2004.

[23] C. Traver, R.B. Reese, and M.A. Thornton, “Cell Designs for Self-Timed FPGAs,” Proc. ASIC/SOC Conf., 2001.

[24] W. Tsu, K. Macy, A. Joshi, R. Huang, N. Walker, T. Tung, O.Rowhani, V. George, J. Wawrzynek, and A. DeHon, “HSRA:High-Speed, Hierarchical Synchronous Reconfigurable Array,”Proc. Int’l Symp. Field Programmable Gate Arrays, 1999.

[25] N. Weaver, J. Hauser, and J. Wawrzynek, “The SFRA: A Corner-Turn FPGA Architecture,” Proc. Int’l Symp. Field ProgrammableGate Arrays, Feb. 2004.

[26] T.E. Williams, “Self-Timed Rings and Their Application toDivision,” PhD thesis, Stanford Univ., 1991.

[27] Xilinx, Virtex2 2.5V Field Programmable Gate Arrays Xilinx DataSheet, 2002.

John Teifel is a PhD candidate in the ComputerSystems Laboratory at Cornell University. Hereceived the MS degree in electrical andcomputer engineering from Cornell Universityin 2002 and the BS degree in electricalengineering from the California Institute ofTechnology in 2000. His research interestsinclude asynchronous VLSI design, computerarchitecture, reconfigurable hardware, and logicsynthesis. He is a recipient of a US National

Science Foundation Graduate Research Fellowship. He is a studentmember of the IEEE.

Rajit Manohar received the BS (1994), MS(1995), and PhD (1998) degrees in computerscience from the California Institute of Technol-ogy. He has been a member of the Cornellfaculty since 1998 where he cofounded itsComputer Systems Laboratory. He is currentlyan associate professor of electrical and compu-ter engineering and a member of the graduatefields of computer science and applied mathe-matics. His group conducts research on all

aspects of asynchronous design: circuits, VLSI, system architecture,fault-tolerance, energy efficiency, and design automation. He is amember of the IEEE and IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

1392 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER 2004


Recommended