Implementation and validation of architectural space exploration techniques for domain-specific...

Des Autom Embed SystDOI 10.1007/s10617-013-9118-1

Implementation and validation of architectural spaceexploration techniques for domain-specificreconfigurable computing

Gayatri Mehta · Alex K. Jones

Received: 16 April 2012 / Accepted: 16 July 2013© Springer Science+Business Media New York 2013

Abstract Domain specific coarse-grained reconfigurable architectures (CGRAs) have greatpromise for energy-efficient flexible designs for a suite of applications. Designing such areconfigurable device for an application domain is very challenging because the needs ofdifferent applications must be carefully balanced to achieve the targeted design goals. It re-quires the evaluation of many potential architectural options to select an optimal solution.Exploring the design space manually would be very time consuming and may not even befeasible for very large designs. Even mapping one algorithm onto a customized architecturecan require time ranging from minutes to hours. Running a full power simulation on a com-plete suite of benchmarks for various architectural options require several days. Finding theoptimal point in a design space could require a very long time. We have designed a frame-work/tool that made such design space exploration (DSE) feasible. The resulting frameworkallows testing a family of algorithms and architectural options in minutes rather than daysand can allow rapid selection of architectural choices. In this paper, we describe our DSEframework for domain specific reconfigurable computing where the needs of the applicationdomain drive the construction of the device architecture. The framework has been developedto automate design space case studies, allowing application developers to explore architec-tural tradeoffs efficiently and reach solutions quickly. We selected some of the core signalprocessing benchmarks from the MediaBench benchmark suite and some edge-detectionbenchmarks from the image processing domain for our case studies. We describe two searchalgorithms: a stepped search algorithm motivated by our manual design studies and a moretraditional gradient based optimization. Approximate energy models are developed in eachcase to guide the search toward a minimal energy solution. We validate our search resultsby comparing the architectural solutions selected by our tool to an architecture optimizedmanually and by performing sensitivity tests to evaluate the ability of our algorithms to find

G. Mehta (B)University of North Texas, Denton, TX, USAe-mail: [email protected]

A.K. JonesUniversity of Pittsburgh, Pittsburgh, PA, USAe-mail: [email protected]

mailto:[email protected]

mailto:[email protected]

G. Mehta, A.K. Jones

good quality minima in the design space. All selected fabric architectures were synthesizedon 130 nm cell-based ASIC fabrication process from IBM. These architectures consumealmost same amount of energy on average, but the gradient based approach is more generaland promises to extend well to new problem domains. We expect these or similar heuristicsand the overall design flow of the system to be useful for a wide range of architectures,including mesh based and other commonly used architectures for CGRAs.

Keywords Domain specific reconfigurable computing · Coarse-grained reconfigurablearchitectures · Design space exploration

1 Introduction

Reconfigurable devices have great promise to address many of the problems encounteredwith application-specific devices by offering flexibility, faster time-to-market, and amor-tized non-recurring engineering (NRE) costs. During the past several years, reconfigurablecomputing has gained significant and increasing importance for applications such as signaland image processing, networking, cryptography, multimedia, and bioinformatics [1–4].

When reconfigurable computing is considered as a design option, the application domain(a suite of applications to be run on the device) is often well known in advance. A generalpurpose design may not guarantee optimal power/performance tradeoffs for that applicationdomain, while a design fine-tuned for a single application may be inflexible. However, a re-configurable design for an application domain can provide some of the flexibility of generalpurpose reconfigurable computing along with some of the optimality that can result fromapplication specific design. This task is more challenging than application specific design,however, because the needs of different applications must be carefully balanced to achievethe targeted design goals. Designing a sophisticated reconfigurable computing platform thatmeets the competing needs of an application domain may require the evaluation of manypotential architectural options. Exploring the design space manually would be very timeconsuming and may not even be feasible for large system-on-chip designs. Even mappingone application or a benchmark onto a customized architecture can require time rangingfrom minutes to hours and traditional algorithms such as Simulated Annealing can fail [5].Running a full power simulation on a complete suite of benchmarks for various architecturaloptions may require several days, making rapid exploration of design options impossible.

We introduce a framework/tool that makes DSE feasible for these scenarios. The keydecisions that we take for our framework are as follows: (i) we choose fast greedy mappingalgorithms with demonstrated good performance, (ii) we develop heuristics to approximatepower consumption and evaluate the quality of these heuristics, and (iii) we compare twofast techniques to identify an optimum point in the design space. The resulting frameworkallows testing a family of algorithms and architectural options in minutes to facilitate rapidselection of architectural choices. The heuristics described here and the overall design flowof the system is not only limited to stripe-based CGRAs. It can be useful for a wide range ofarchitectures, including mesh based and architectures with highly customized interconnect.

In this paper, we describe our DSE framework for domain specific CGRAs. As a testbedarchitecture, we select a generic, parameterized stripe-based CGRA, due to its flexibility andpromise for low energy solutions [6–9]. This paper describes a tool to automate the tailoringof the chosen fabric to an application domain, including choice of interconnect, proportionof dedicated vertical routes, and number of operations per ALU, all factors that have beenshown to contribute greatly to final energy consumption of the architecture [8–12]. As an

Implementation and validation of architectural space exploration

application domain for design case studies, we selected some of the core signal processingbenchmarks from the MediaBench benchmark suite and some of the edge-detection bench-marks from the image processing domain. These benchmarks are of particular interest dueto their applications such as speech processing and digital communications, and platformssuch as cell phones and cameras.

The remainder of this paper is organized as follows: Literature review and related workare presented in Sect. 2. Section 3 provides some background material that includes descrip-tion of the data flow graphs (DFGs) and reconfigurable fabric target used in this research.Section 4 describes the architectural space exploration studies. The DSE flow algorithms andthe results obtained are described in Sect. 5. Section 6 discusses conclusions and considersfuture work.

2 Related work

Several methods have been proposed in the past few years for DSE of reconfigurable ar-chitectures [13–22]. However, the methods described in [13–18] are either technology-dependent or architecture-dependent and do not consider stripe-based fabrics in particular.They deal with low-level of abstraction and provide only limited DSE around their targetarchitecture. Bossuet et al. [19] proposed a DSE method that can be used to cover a widedomain of reconfigurable fabrics, from fine-grained to coarse-grained fabrics, as well asheterogeneous fabrics. They used the architectural processing use rate and the communi-cation hierarchical distribution as metrics to investigate a power-efficient architecture. Kimet al. [20] describe a design flow that arranges processing elements and interconnect ef-ficiently based on sharing and pipelining of critical resources to provide area and powersavings. Sotiropoulou and colleagues [21] describe a DSE flow to determine the optimummemory architecture for FPGA-based multiprocessor systems. Irturk et al. [22] optimize ageneral purpose architecture to a specific application (matrix inversion) by removing unusedresources.

Many researchers present tools that facilitate exploration of design space options. Karuriet al. [23] develop an architecture description language and software support to facilitatecomparing different architectures. Chattopadhyay et al. [24] develop a description languageand software support for specification of CGRAs. Bauer and his colleagues [25] focus onproviding the designer with accurate simulations for comparison. A number of researchers[26, 27] present tools to streamline the evaluation of architectural variations in the ADRESCGRA. Sun and colleagues [28] discuss cost estimates that trade off power, area, and delay.Miramond et al. [29] focus on optimization of hardware/software partitioning and schedul-ing, and Clark et al. [30] focus on optimizing instruction set customization.

Our DSE framework takes into account the impact of varying different design parameterssuch as the interconnect cardinality (number of inputs/fan-in of multiplexers), dedicated ver-tical routes, and number of operations supported by arithmetic and logic units (ALUs) ontopower and performance of the device. The impact of interconnect cardinality onto power wasstudied in [9]. The benefit of adding dedicated vertical routes into the architecture called ded-icated pass gates to prevent functional units from being used as routing was also examinedin [9]. The effect of number of operations supported by ALUs onto power was describedin [12]. The DSE tool uses multiplexer cardinality, pass gates, and number of operationssupported by ALUs as parameters. It proceeds to optimize an architecture by monotonicallyreducing multiplexer cardinality, reducing ALU operations, and increasing DPs in a steppedoptimization or as guided by a simple energy model until a solution estimated to have mini-mum energy is found. Our approach is applicable in particular to stripe-based architectures


such as PipeRench, Kilocore, etc., [6, 7]. However, it may also be extended to other systemswhere high-level graph structures may be retained during CAD.

We are also motivated by research on Application Specific Instruction Set Processors(ASIPs). ASIPs attempt to extend normal processor cores with custom instructions, oftenin reconfigurable hardware, designed to tailor the processor to a particular domain of ap-plications or even a single application for improved performance. One of the earliest workson this topic is the dynamic instruction set computer (DISC) developed at Brigham YoungUniversity [31]. DISC uses partial reconfiguration available in FPGAs to swap in and outcustom instructions based on demand of the application. Cong and his colleagues [32] de-scribed a technique to generate the instructions included in an ASIP using pattern matchingof input applications. Mbaye and colleagues [33, 34] presented an ASIP approach to accel-erate video processing. They applied the concepts of data grouping and data reuse to avoidexpensive data communications. A family of ASIPs that can be reconfigured dynamicallyfor a suite of channel coding applications for wireless communications is presented in [35].Guan et al. [36] described a hierarchical design of an ASIP customized for Fast FourierTransform to achieve high throughput and flexibility. They reduced the memory access timeby incorporating custom register files. Shen et al. [37] describe a video specific instructionset architecture including both single instruction multiple data (SIMD) and custom videospecific instructions. Fanucci et al. [38] describe a processor architecture for non-linear im-age processing algorithms. In terms of the design flows, Brisk et al. [39] describe an optimalpolynomial time solution to determine how many registers to include in an ASIP. Finally,Dinh et al. [40] describe a method to use resource sharing in the custom instructions ofASIPs to reduce area and improve performance.

Our work differs as our DSE framework generates a reconfigurable architecture, whichcan be customized for a suite of applications it executes in a similar manner as ASIPs. ASIPsflow reads one or more programs into the tool and attempts to identify custom instructionswhich when added to the processor instruction set architecture could improve an executionmetric such as power or performance. Similarly, the DSE flow for our fabric takes infor-mation from the applications and produces statistics from different instances of the fabric.These statistics are used to make decisions and evaluate tradeoffs to converge on a fabricinstance tailored to the applications it executes. The tool generates a tailored architecturalinstance to reduce power for a given suite of applications. In our research, we are customiz-ing the reconfigurable fabric itself, rather than utilizing a generic reconfigurable fabric tocustomize a processor.

Portions of this research have been published previously in [41]. This manuscript intro-duces a gradient based optimization with an improved energy model and provides manynew figures and comparisons to evaluate the proposed techniques. In fact, development ofthe gradient based optimization and the new energy model and analysis of its results al-lowed us to set a better threshold and thus better results for the previously published steppedoptimization approach.

3 Background

This section provides background material about SDFGs and the domain-specific CGRAtarget used in this research. It also describes how DFGs can be mapped onto the CGRA.An example DFG is mapped onto a stripe-based CGRA with and without dedicated passgates in the architecture. The Fabric Instance Model (FIM) which is used to describe theinterconnect and the layout and the make-up of the ALUs in the fabric is also explained inthis section.


Fig. 1 Software code and DFGshowing control flow in ADPCMencoder

3.1 Super data flow graphs

Like many synthesis flows, the portion of code to be converted into hardware is convertedinto a Control Data Flow Graph (CDFG) representation. CDFG representation consists ofa set of basic blocks interconnected by control flow edges. The details of the creation of aCDFG representation from a high level language can be found in [42]. These control flowedges are created by control statements within the code such as loop boundaries of condi-tional statements. Using hardware predication, these control dependencies can be convertedinto data dependencies [43]. This post-predication CDFG is referred to as a Super DataFlow Graph (SDFG). For example, a conditional statement, such as an if-then-else C codesegment, is implemented as a multiplexer acting as a binary switch to predicated outputdatapaths. In software, an if-then-else statement is implemented as a stream of six instruc-tions composed of comparisons and branch statements. Software code and a DFG for a 2:1multiplexer equivalent are shown in Fig. 1.

The performance of the hardware function using the SDFG approach is superior to tradi-tional techniques that would place sequential steps at control boundaries due to cycle com-pression. Rather than enforcing sequential stopping points in the computation with regis-tered boundaries, the execution flows through the SDFG limited only by the latencies of thefunctional units in the critical path. Thus, SDFG based hardware functions operate asyn-chronously from the processor core. Also removing the sequential logic makes the hardwarefabric for implementing the SDFG much simpler. The control flow can be implemented ona processor but in this paper, we focused only on implementing SDFGs of the benchmarksonto a domain-specific coarse-grained reconfigurable fabric. Our work is focused on cus-tomizing the reconfigurable fabric for a suite of signal and image processing applications togenerate an energy-efficient fabric architecture.

3.2 Domain specific fabric

Stripe-based hardware fabrics are designed to easily map DFGs from the application ontothe device. The domain-specific fabric works in a similar way, retaining a data flow structureallowing computational results to be computed in one ALU and flow onto others in thesystem. As shown in Fig. 2, ALUs are organized into rows or computational stripes within


Fig. 2 The fabric model iscomprised of Arithmetic andLogic units and a reconfigurableinterconnect

Fig. 3 The multiplexer-basedinterconnection stripe structure

Fig. 4 An example of interconnect—8:1 multiplexer interconnect

which each functional unit operates independently. The results of these ALU operations arethen fed into interconnection stripes constructed using multiplexers.

A detailed diagram of the interconnection stripes is shown in Fig. 3. In addition to de-termine how many wires are available for routing, the cardinality of the multiplexer deter-mines the maximum fan-in and fan-out possible in the DFG to map onto the structure. A 8:1mulitplexer-based interconnect is shown in Fig. 4. In this configuration, there are eight pos-sible locations to read the input operands from the previous row.


Fig. 5 Problem graph—Partialbutterfly graph

Fig. 6 Problem graph fromFig. 5 mapped on 8:1interconnect with 25 % dedicatedpass gates

3.3 Mapping of applications onto domain-specific reconfigurable fabric

A mapping of a DFG onto a reconfigurable fabric consists of an assignment of operators inthe DFG to ALUs in the reconfigurable fabric such that the logical structure of the DFG ispreserved and the architectural constraints of the fabric are followed. This mapping problemis very critical to the use of the fabric because a mapping solution must be available eachtime the fabric is reprogrammed for a specific DFG. Because of the layered nature of thefabric, the mapping is also allowed to use ALUs as pass-gates, which take a single input andpass the input value to one or more outputs. In general, not all of the available ALUs andedges will be used. An example DFG is shown in Fig. 5. It is the partial butterfly graph thatappears frequently in signal and image processing applications. This graph is extracted fromthe inverse discrete cosine transform of MPEG II video compression but it appears in manyapplications of video and image compression. It also appears within fast Fourier transforms(FFTs), which are commonly used in signal processing. Between the first and second rows,this graph consists of three pairs of nodes (A, B), (C, D), and (E, F) that communicatewith three other pairs of nodes (G, H), (I, J), and (K, L) in the subsequent stage. However,between the second and third stage nodes G-L are grouped into three different pairs (G, I),(H, K), and (J, L) which communicate with three new pairs in the third stage (M, N), (O, P),and (Q, R).

Figure 6 shows the DFG from Fig. 5 mapped onto the architecture with 8:1 multiplexer-based interconnect and 25 % dedicated pass gates. The ALUs are shown in circles and thededicated pass gates are shown in squares. An ALU can also be used for pass operationshown in red color. The dedicated pass gate that is being used in this example graph isshown in green color. The empty circles and squares shown in white are idle. The impact ofusing dedicated pass gates (which simply route data vertically from one row to the next) in areconfigurable architecture onto power consumption has been explored [9]. A dedicated pass


<rowpattern repeat="forever"><row><ftupattern repeat="forever"><FTU type="alu0"><operand number="0"><range left ="-2" right ="1"/>

</operand><operand number="1"><range left ="-1" right ="2"/>

</operand><operand number="2"><range left ="-1" right ="2"/>

</operand></FTU>

</ftupattern></row></rowpattern>

Fig. 7 FIM file example for 5:1 style interconnect

gates consumes very less power as compared to the ALU being used for pass operation. Thededicated pass gate can also be set to idle state when not being used. For vertical routing,we varied the percentage of dedicated pass-gates at levels of 25 % (1 out of 4), 33 % (1 outof 3), and 50 % (1 out of 2).1

3.4 Fabric instance model

The Fabric Instance Model (FIM) is a textual representation to describe the interconnect andthe layout and make-up of the ALUs in the device. The FIM file is written in the ExtensibleMarkup Language (XML) [44]. XML was selected as it allowed the FIM specification toeasily evolve as new features and descriptions were required. For example, while the FIMwas initially envisioned to describe the interconnect only, it has evolved to describe dedi-cated pass-gates and other heterogeneous ALU structures.

Figure 7 shows an example partial FIM file that describes a 5:1 multiplexer-based in-terconnect. The 5:1 interconnect is shown in Fig. 8. The pattern repeats the interconnectstructure for alu0, whose operand 0 can read from two units to the left and one unit to theright, and operand 1 is the mirror. Operand 2 is the selection bit if the ALU is configured asa multiplexer and follows operand 1. The ranges in the FIM can be discontinuous by sup-plying additional range flags. The file can contain a heterogeneous interconnect by definingadditional functional unitsFUs with different interconnect ranges. The pattern can either re-peat or can be arbitrarily customized without a repeating pattern for a fixed size fabric. Inthe FIM file, the types of operations supported by each ALU can also be specified. Figure 9shows an example partial FIM file that describes a fabric having an 8 operations ALU, 8:1interconnect and 50 % dedicated pass gates.

1We note here that alternatives to this arrangement of dedicated pass gates are possible. In particular, wecould provide dedicated routes in conjunction with each ALU to allow that ALU to be bypassed. However,we found such an arrangement to be expensive within the context of our design space, due to the need foradditional multiplexers, and we do not consider it here. Instead, we search for the most efficient proportionof dedicated routes to provide, hence keeping the number of additional multiplexers to the minimum thatprovide us with energy gains vs. energy expense.


Fig. 8 Schematic for a 5:1multiplexer built using 4:1multiplexers

<ftudefine name="alu[0]" useic="false" noop="000" type="alu"><op code="001" commutative="true"> + </op><op code="010" commutative="false"> - </op><op code="011" commutative="true"> * </op><op code="100" commutative="false"> >> </op><op code="101" commutative="false"> mux </op><op code="110" commutative="false"> pass </op><op code="111" order="reverse" commutative="false"> pass </op><op code="000" commutative="true"> noop </op></ftudefine>

<ftudefine name="pass" useic="false" noop="0" type="pass"><op code="0" commutative="true"> noop </op><op code="1" commutative="false"> pass </op></ftudefine>

<rowpattern repeat="forever"><row><ftupattern repeat="forever"><FTU type="alu[0]"><operand number="0"><range left ="-3" right="4"/></operand>

<operand number="1"><range left ="-3" right="4"/></operand><operand number="2"><range left ="-3" right="4"/></operand></FTU><FTU type="pass"><operand number="0"><range left ="-3" right="4"/></operand></FTU></ftupattern>

</row></rowpattern>

Fig. 9 FIM file example for an 8 ops ALU, 8:1 interconnect with 50 % dedicated pass gates


The FIM file is used to automatically generate the VHDL for the fabric instance de-scribed by the FIM. The FIM file shown in Fig. 9 dictates the fabric generator to generatea heterogeneous ALU stripe with ALUs and dedicated pass gates. The fabric generator alsogenerates a vhdl code for an ALU where ALU supports only the operations defined in theFIM file. Each dedicated pass gate can act as a pass gate or a noop. The ALU, and passgate pattern repeats in the vhdl code as defined in the FIM file. The generated vhdl codehas two ALUs (A) and two pass gates (P) in the APAP pattern. The fabric instance VHDLis then synthesized using commercial tools such as Synopsys Design Compiler to generatea netlist tied to ASIC standard cells.

4 Architectural space exploration studies

In order to conduct architectural exploration case studies, we selected a set of core signalprocessing benchmarks from MediaBench benchmark suite including the ADPCM encoder(enc), ADPCM decoder (dec), GSM channel encoder (gsm), and the MPEG II decoder (row,col). We added the Sobel (sob) and Laplace (lap) edge detection algorithms to the bench-mark suite. Using the SuperCISC compilation flow [43], computational intensive kernelswere extracted for the above mentioned signal and image processing applications and con-verted into SDFGs (Sect. 3.1). We analyzed the SDFGs of the signal and image processingapplications and identified the operations that need to be supported in the CGRA. Figure 10shows the number of additions, multiplications, etc. that appear in the benchmark applica-tions. Each ALU aslo supports NOOP and pass gate operations. Table 1 shows the numberof operations contained in the benchmarks. Operations include only regular arithmetic, logicand shift operations such as addition, multiplication, AND, OR, right-shift, etc.

Fig. 10 ALU operations for the benchmark suite


Table 1 Number of operationsin DFGs of the benchmarks enc dec row col gsm sob lap

Operations 36 29 52 61 29 24 29

5 Design space exploration flow

Exploring the design space manually would be very time consuming and may not even befeasible for large designs. When we map an application or a benchmark onto a customizedarchitecture, it might require time ranging from minutes to hours and traditional algorithmssuch as Simulated Annealing might even fail [5]. Running a full power simulation on acomplete suite of benchmarks for various architectural options require several days. In orderto find an optimal point in a design space, it could require a very long time. In this section,we describe a framework/tool that we have designed to make such DSE feasible for thesescenarios. We make the following key decisions for our framework:

– We choose two fast greedy mapping algorithms (stepped-search and gradient-based opti-mization) with demonstrated good performance.

– We develop heuristics to approximate power consumption and evaluate the quality ofthese heuristics.

– We also compare the ability of the above mentioned fast greedy mapping techniques toidentify an optimum point in the design space.

The resulting framework allows us to test a wide variety of algorithms and architecturaloptions in minutes rather than several days. It also gives us the ability to rapidly explorealternative architecture choices. The DSE flow for the domain specific fabric is shown inFig. 11. The computational intensive kernels of the signal and image processing benchmarkswere extracted and converted into SDFGs (Sect. 3.1) using SuperCISC compilation flow[43]. The tool makes a selection of interconnect, layout of vertical routes, and number ofALU operations using one of the algorithms (stepped search optimization algorithm andgradient-based search algorithm) described later in this section and generates a FIM filedescribing the resulting fabric, as described in Sect. 3.4. The SDFGs are then mapped to thefabric architecture described by the FIM using heuristic mapper. The details of the heuristic

Fig. 11 DSE flow for thedomain specific fabric


mapper can be found in [45]. The heuristic mapper does not produce a globally optimalmapping but it was chosen because of its fast execution time. Detailed power simulationsare then run for the architecture and mappings produced.

5.1 Stepped search optimization algorithm

We chose two algorithms to test to explore the design space automatically. The first al-gorithm was motivated by our discoveries while manually exploring architectural designchoices. We found that multiplexer cardinality affected results the most, followed by per-centage of dedicated pass gates, followed by number of ALU operations. Informally, aninterconnect that is more complex than required is costly in terms of both energy and area;using ALU’s to route data is wasteful, but only moderately expensive compared to inter-connect, and reducing the number of ALU operations can give savings at a slightly smallerscale. In our manual explorations, interconnect adjustments accounted for 50 % of our finalsavings, optimizing dedicated pass gates accounted for 25 %, and reducing ALU operationsaccounted for 25 %.

Following the results of our manual studies, we chose as our first algorithm a steppedalgorithm that optimizes interconnect, followed by percentage of dedicated pass gates, fol-lowed by number of ALU operations. The algorithm is given as Algorithm 1. The threestages are as follows:

– reducing multiplexer cardinality (C) to next power of 2n+1, 2n where n = 5,4, . . . ,1,– increasing the number of dedicated pass gates (D) where D = 0 %,25 %,33 %,50 %, . . . ,

75 %, and– decreasing the number of operations supported by each ALU (O). When we reduce the

number of operations supported per ALU, we take into account the frequency of theiroccurence in the benchmarks as shown in Fig. 10. The operations that appear least fre-

Algorithm 1 Stepped search algorithm for DSE1: while average path length increase (pli) < threshold do2: Reduce multiplexer cardinality (C) to next power of 2n+1,2n where n = 5,4, . . . ,1.3: Map applications to the fabric using heuristic mapper.4: Determine average path length increase.5: end while6: Revert to last C where average pli < threshold.7: while average pli < threshold do8: Increase the number of dedicated pass gates (D) where D = 0 %,25 %,33 %,50 %,

. . . ,75 %.9: Map applications to the fabric using heuristic mapper.

10: Determine average path length increase.11: end while12: Revert to last percentage of D where average pli < threshold.13: while average pli < threshold do14: Reduce the number of operations supported by each ALU (O) by one.15: Map applications to the fabric using heuristic mapper.16: Determine average path length increase.17: end while18: Revert to last number of ALU operations O where average pli < threshold.


quently are taken out first. We make sure that we support the ALU operations that areneeded by the applications more frequently.

To determine when to halt at each stage of the stepped algorithm, we require an estimateof energy consumption for the resulting design. The trend that is observed is that initial ad-justments to each parameter will reduce energy until a point of inherent difficulty is reached,at which point, the architecture becomes energy inefficient. It is too expensive to run detailedpower simulations in the inner loop of our tool to find this point of inefficiency, and so werequire a proxy for true energy consumption.

5.1.1 A proxy for estimating energy

From our manual design space case studies, we determined that the two factors that affectthe energy consumption of the device are: (1) the increase in the total path length of themapped application onto the device, and (2) the number of ALUs used as pass-gates. Thedetails of our manual design space case studies can be found in [8, 9, 12]. The total pathlength in the mapped design is the sum of the number of rows traversed from each inputto each output. Intuitively, when an architecture becomes overconstrained, such that it isdifficult to map benchmark algorithms onto an architecture, the mapper will compensate,making the mapping problem easier by adding rows to the fabric, increasing the total pathlength, introducing additional delays and using additional resources to pass data, all of whichincrease energy requirements.

The number of ALUs used as pass-gates is also useful in judging success, especially incases where the fabric contains dedicated pass-gates. Dedicated pass-gates are more energyefficient than complex functional units at passing a value (more than an order of magnitudedescribed in [9]). When using dedicated-pass gates, fewer ALUs must be used as pass-gates,allowing for substantial power savings. However, if the architecture is too constrained for thegiven benchmarks, then the mapper will again make use of additional rows to help alleviateconstraints, frequently using some ALU’s as passgates to route data.

To demonstrate that these factors influence the energy consumption of the device, we rana two-way analysis of variance (ANOVA) on the energy with the number of ALUs used aspass-gates and path length as factors to determine the correlation. Using an alpha value of0.05, both factors significantly influenced the energy (p < 0.01 and p = 0.031, respectively)described in [9]. We used the average path length increase (average pli) as a metric in ourstepped search algorithm for simplicity.

Figure 12 shows the results obtained from the DSE tool for various fabric architectures.As the multiplexer cardinality was reduced from 33:1, 32:1, 17:1, 16:1, 9:1, and 8:1, theaverage path increase stayed at zero. As it reached 5:1 interconnect, the average path lengthincrease went up to 7.7 because we were restricting the connectivity of each ALU and mak-ing the mapping problem more difficult. Since it exceeded the threshold limit for a 5:1 in-terconnect, the tool selected an 8:1 interconnect and then started changing the percentage ofdedicated pass gates from 0 % to 25 % to 33 % for that interconnect. Now it selected an 8:1interconnect with 25 % dedicated pass gates based on the average path length increase andthen it started reducing the number of operations supported per ALU for that architecture. Asit started reducing the number of ALU operations, the mapping problem became even morechallenging because each ALU could support only certain number and types of operationsand the resulting mappings began to require greater path length and more time to execute.It selected an 8:1 interconnect with 25 % dedicated pass gates and 9 operations per ALUas the best candidate that could meet the target design goal. Figure 13 shows the number ofALUs used as pass gates for various architectures explored for threshold value of average


Fig. 12 Results from the DSE tool for various architectures for threshold value of average pli = 2

Fig. 13 ALUs used as pass gates for various fabric architectures explored for average pli = 2


pli to be 2. As the number of dedicated pass gates in the fabric architecture increases, fewernumber of ALUs are needed for pass operation.

5.2 Gradient-based search using our proposed energy model

While the stepped optimization shown in the previous section is appealing due to its sim-plicity, we desired a more general gradient-based search algorithm for our application. Inaddition, we desired an energy model that was still fast to compute, but more clearly con-nected to the architectural decisions we were making.

Our gradient based optimization works as shown in Algorithm 2. At each stage, we at-tempt each of three variations on the current architecture: (i) reduce multiplexer cardinality(C) to next power of 2n+1, 2n where n = 5,4, . . . ,1, (ii) increase the number of dedicatedpass gates (D) where D = 0 %,25 %,33 %,50 %, . . . ,75 %, and (iii) decrease the numberof operations supported by each ALU (O). We reduce number of ALU operations exactlyas in Algorithm 1. The new architecture selected is the minimum energy choice from thesevariations. The algorithm proceeds until no solution better than the existing one is found.

Algorithm 2 Gradient search algorithm for DSE1: Let E be the current energy estimate2: Let A be the current architecture3: Let newArchitecture be TRUE4: while newArchitecture is TRUE do5: newArchitecture = FALSE6: Starting from architecture A, reduce multiplexer cardinality (C) to the next power of

2n where n = 5,4, . . . ,1 and call the resulting architecture AC

7: Estimate energy EC for architecture AC

8: Starting from architecture A, increase the number of dedicated pass gates (D) whereD = 0 %,25 %,33 %,50 %, . . . ,75 % and call the resulting architecture AD

9: Estimate energy ED for architecture AD

10: Starting from architecture A, reduce the number of operations supported by eachALU (O) by one and call the resulting architecture AO .

11: Estimate energy EO for architecture AO

12: If (EC < E), then A = AC;E = EC ; newArchitecture = TRUE;13: If (ED < E), then A = AD;E = ED ; newArchitecture = TRUE;14: If (EO < E), then A = AO;E = EO ; newArchitecture = TRUE;15: end while

An energy model for this search algorithm (Algorithm 2) was created by fitting a simplelinear model to results from a small number of experimental runs. Energy was calculatedby computing the product of the power and delay of the design. To calculate the power anddelay of the design, we ran a number of complete power simulations and used the results toestimate a simple linear model for energy consumption based on easily measurable featuresof our mapping for each proposed architecture. For our detailed power simulations, the fab-ric VHDL is synthesized into Synopsys cell-based ASIC design with a feature size of 130nm using Synopsys Design Compiler. The post-synthesis design was simulated in MentorGraphics ModelSim to calculate the delay of each design and these simulations were usedas stimulus to the Synopsys PrimeTime-PX tool to estimate the power consumption of the


device. The features of interest for our linear model were the number of ALU’s perform-ing operations, number of operations supported by each ALU, number of ALU’s used toroute data, and the number of multiplexers of each size. The linear fit of our model to ourexperimental data is:

Energy Model:

ET = (CAC − PAC(NOB − NOA)

) + (NAP ∗ CAP ) + (NM32 ∗ CM32)

+ (NM8 ∗ CM8) + (NM4 ∗ CM4) + (NM2 ∗ CM2)

where parameters are defined as:

CAC : Baseline energy cost of ALUs with all real benchmark operations/computations (e.gmultiply, add, shift, etc.) = 75 pJ

PAC : Savings in baseline energy for each fewer operation = 6 pJCAP : Cost of each ALU used as a pass gate = 0.225 pJCM32: Cost of a 32:1 multiplexer = 0.416 pJCM8: Cost of a 8:1 multiplexer = 0.244 pJCM4: Cost of a 4:1 multiplexer = 0.143 pJCM2: Cost of a 2:1 multiplexer = 0.083 pJNAC : Number of ALUs doing computationsNAP : Number of ALUs used for pass operationNM32: Number of 32:1 multiplexersNM8: Number of 8:1 multiplexersNM4: Number of 4:1 multiplexersNM2: Number of 2:1 multiplexersNOB : Number of operations required by the benchmark suiteNOA: Number of operations supported by each ALUET : Total energy consumption of a benchmark

Because dedicated pass gates consume negligible amount of energy for pass operation, wedid not take into account their energy cost in our basic energy model.

We note that many different objective functions could be considered for our algorithms.For example, one user may wish to emphasize power and another performance. In somecircumstances the Energy Delay Product (EDP) may be considered more appropriate tooptimize. More accurate energy models may be available than our linear fit to experimentaldata. Any of these changes are trivial in our framework. All that is required is that the metricthat we wish to optimize can be estimated based directly from the mapping of benchmarksonto a proposed architecture.

The energy results obtained using our gradient-search algorithm and our simple energymodel are shown in Fig. 14. For each design point, we computed the energy of the fabricarchitecture using our energy model. For each architecture shown along X-axis, we obtainthree data points corresponding to reducing the cardinality of the interconnect (shown in redsquare), reducing the number of ALU operations (shown in black cross), and increasing thenumber of dedicated pass gates (shown in green triangle). The data points that correspond tothe minimum energy architectural option for each architecture considered along X-axis areshown in blue diamond and connected by a blue line. Using this criterion, the tool selectedthe 8:1 architecture with 10 operations per ALU and 50 % dedicated pass gates as the bestarchitecture in terms of energy consumption.

Recall our initial discussion that the number of ALUs used as pass gates and path lengthincrease are correlated with energy usage and were considered as proxies for energy in ourfirst set of experiments using Algorithm 1. We can evaluate these proxies by plotting their


Fig. 14 Energy-guided search results

values for each architecture tested in our gradient based search. We expect the lowest energysolution in general to have the best values for these proxies as well.

Figure 15 shows the number of ALUs used as pass gates for various architectures ex-plored using energy-based guided search. For each architecture shown along X-axis, we ob-tain three data points corresponding to reducing the cardinality of the interconnect (shown inred square), reducing the number of ALU operations (shown in black cross), and increasingthe number of dedicated pass gates (shown in green triangle). The data points that corre-spond to the minimum energy architectural option for each architecture considered alongX-axis are shown in blue diamond and connected by a blue line. Results in Fig. 15 indi-cate that the number of ALUs used as pass gates is not ideal for use as an energy proxy. Inthe majority of cases, this proxy would have selected to increase the number of dedicatedpass gates rather than choosing other alternatives which have lower energy cost based onour model. This tendency is not surprising in retrospect, as increasing dedicated pass gatesalmost always reduces the number of ALUs used as pass gates.

Using the same color scheme, we show the average path length increase for various fabricarchitecture design options examined here in Fig. 16. Results from Fig. 16 suggest that pathlength increase is a better proxy for energy consumption than the number of ALUs used aspassgates. The path length increase metric almost always selects the same next architectureas our energy model. The one exception is at 8:1, 25 %, homogeneous ALUs, where thepath length increase metric would choose to reduce the number of ALU operations insteadof increasing the number of dedicated pass gates. We can see, however, from Fig. 14 that theenergy estimates for these two cases are nearly identical, and so the choice is in fact a tossup.


Fig. 15 ALUs used as pass gates for various fabric architectures explored in Energy-guided search

Fig. 16 Average Path Length Increase for various fabric architectures explored in Energy-guided search.Note that all of the variations in architecture that increased energy generated a PLI of 5 or greater (shown incircled data points), while those that decreased energy resulted in a PLI less than 5


Fig. 17 Results from the DSE tool for various architectures for threshold value of average pli = 5

However, interestingly, the results shown in Fig. 16 also suggest that our choice for PLIthreshold in the stepped optimization could have been better. In fact, these results stronglysuggest choosing a PLI threshold of 5. Note that all of the variations in architecture thatincreased energy generated a PLI of 5 or greater (shown in circled data points), while thosethat decreased energy resulted in a PLI less than 5.

Figure 17 shows the results obtained from the DSE tool for various fabric architectureswhen using Algorithm 1 for a threshold value for the average path length set to 5. In thebeginning, the search algorithm results match the ones we obtained using average pli = 2.After the tool selected an 8:1 interconnect, it started changing the percentage of dedicatedpass gates from 0 % to 25 %, 33 %, 50 %, 67 % for that interconnect. Then, it selected an8:1 interconnect with 50 % dedicated pass gates based on the average path length increaseand then it started reducing the number of operations per ALU for that architecture. As itstarted reducing the number of operations supported by each ALU, this made the mappingproblem even more difficult because each ALU can support only certain number and types ofoperations. It selected an 8:1 interconnect with 50 % dedicated pass gates and 10 operationsper ALU as the best candidate that could meet the target design goals, which is identicalto the architecture selected by gradient based search. Figure 18 shows the number of ALUsused as pass gates for various architectures explored for threshold value of average pli tobe 5.

5.3 Comparison of architectures selected by manual and automated design case studies

Prior to development of the DSE tool described in this manuscript, a number of manual de-sign studies were performed. These manual studies were performed over considerable time


Fig. 18 ALUs used as pass gates for various fabric architectures explored for average pli = 5

(many months). Testing each new architecture design required weeks of setup and simula-tion. At the end of the process, we were confident that we had selected the best architecturedesign that was feasible in this (lengthy) time frame. Our design choice was the fabric ar-chitecture with 10 operations per ALU, 8:1 interconnect and 33 % dedicated pass gates [9].

Our motivation with the DSE tool described in this paper was to cut this lengthy processfrom days (or months) to minutes or hours and obtain nearly equivalent results. To evaluatehow well we achieve this goal, we compare results from our manual study and our automatedsearches.

Figure 19 compares the energy consumption of the architecture selected from manualdesign space case studies with the architectures selected by the DSE tool for threshold valuesof the average path length increase to be 2 and 5. The architecture selected by the frameworkfor pli = 5 (identical to that selected by the gradient based optimization) consumes almostsame energy as that of the architecture picked by manual design case studies.

Figure 20 compares the delay of the architectures selected from manual studies and theDSE tool. On average, the delays are nearly identical.

Finally, Fig. 21 compares the power consumption of the architectures selected from man-ual studies and the tool. As expected based on the energy and delay results, the architectureselected by the framework for pli = 5 (identical to that selected by the gradient based op-timization) requires almost the same power as that of the architecture picked by manualdesign case studies.

Our stepped search and gradient based optimization algorithms presented here aregreedy. As such they are not guaranteed to find a global optimum. However, to further evalu-ate the architecture selections of the two algorithms, we perform a sensitivity test around thefinal result, testing estimated energy values for a broad range of architectures. In total, 134


Fig. 19 Energy comparison between the manual architectural solution and the architectural solutions gener-ated by the tool

Fig. 20 Delay comparison between the manual architectural solution and the architectural solutions gener-ated by the tool


Fig. 21 Power comparison between the manual architectural solution and the architectural solutions gener-ated by the tool

Fig. 22 Sensitivity test showing a broad range of architectures explored

architectural instances were tested by varying the cardinality of interconnect, the percentageof dedicated pass gates and the number of operations supported per ALU. For each instance,energy was computed and plotted. The outcome of this sensitivity test is shown in Fig. 22.As the figure shows, for our benchmark suite, we can expect that the architecture selectedby the DSE tool is very close to the global optimum.

6 Conclusion and future work

Designing a sophisticated reconfigurable computing platform that meets the competingneeds of an application domain is extremely challenging as it requires the evaluation ofmany potential architectural options to select an optimum solution. Exploring the designspace manually would be very time consuming and may not even be feasible for largesystem-on-chip designs. We have designed a framework that made such DSE feasible. The


resulting framework allows testing a family of algorithms and architectural options in min-utes rather than days and can allow rapid selection of architectural choices. We expect theoverall design flow of the system to be useful for a wide range of architectures, includingmesh based and other commonly used architectures for CGRAs. We used stripe-based ar-chitecture for our case studies. The tool generates a tailored architectural instance based onthe needs of the applications to reduce power for a given suite of applications by monotoni-cally reducing multiplexer cardinality, ALU operations, and increasing dedicated pass gatesin a stepped optimization or as guided by a simple energy model until a solution estimatedto have minimum energy is found. We compare the energy consumption of the architectureselected from manual design space case studies with the architectures selected by the DSEtool using stepped search and gradient search with energy model. The architecture having10 operations per ALU, 8:1 interconnect, and with 33 % dedicated pass gates consumesthe least energy. The architecture selected by the gradient-search using our energy model isidentical to the architecture selected by stepped-search with average pli = 5, and this archi-tecture consumes almost the same energy as the manually selected architecture. Our plan forfuture research is to develop an improved energy model and broaden the set of architecturalchoices considered by our tool.

References

1. Monaghan S, Cowen C, Noakes PD (1993) Using fpgas to implement reconfigurable dsp architectures.In: IEE colloquium on field programmable gate arrays—technology and applications

2. Fawcett BK (1995) Fpgas in reconfigurable computing applications. In: WESCON3. Kramberger I (1999) Dsp acceleration using a reconfigurable fpga. In: Proc of IEEE international sym-

posium on industrial electronics4. Katona M, Krajacevic Z, Teslic N, Kovacevic V (2005) Signal processing algorithms implementation

with fpgas. In: 7th international conference on telecommunications in modern satellite, cable and broad-casting services 2005, vol 1, pp 127–130. doi:10.1109/TELSKS.2005.1572078

5. Baz M (2008) Optimization of mapping onto a flexible low-power electronic fabric architecture. PhDDissertation, University of Pittsburgh

6. Levine B, Schmit H (2002) Piperench: power and performance evaluation of a programmable pipelineddatapath. In: Presented at hot chips, vol 14

7. Levine B (2005) Kilocore: scalable, high-performance, and power efficient coarse-grained reconfigurablefabrics. In: International symposium on advanced reconfigurable systems

8. Mehta G, Stander J, Lucas J, Hoare RR, Hunsaker B, Jones AK (2006) A low-energy reconfigurablefabric for the supercisc architecture. J Low Power Electron 2(2):148–164

9. Mehta G, Stander J, Baz M, Hunsaker B, Jones AK (2009) Interconnect customization for a hardwarefabric. ACM Trans Design Autom Electron Syst 14(1):11, 32 pages, doi:10.1145/1455229.1455240

10. Mehta G, Hoare RR, Stander J, Jones AK (2006) Design space exploration for low-power reconfigurablefabrics. In: Proc of the reconfigurable architectures workshop (RAW)

11. Mehta G, Stander J, Baz M, Hunsaker B, Jones AK (2007) Interconnect customization for a coarse-grained reconfigurable fabric. In: Proc of the IPDPS reconfigurable architecture workshop (RAW),pp 165.1–165.8

12. Mehta G, Ihrig CJ, Jones AK (2008) Reducing energy by exploring heterogeneity in a coarse-grainfabric. In: Proc of the IPDPS reconfigurable architecture workshop (RAW)

13. Benoit P, Sassatelli G, Torres L, Demigny D, Robert M, Cambon G (2003) Metrics for reconfigurablearchitectures characterization: remanence and scalability. In: IEEE IPDPS reconfigurable architectureworkshop

14. Enzler R, Jeger T, Cottet D, Troster G (2000) High-level area and performance estimation of hardwarebuilding blocks on FPGAs. In: Field-programmable logic and applications forum on design language

15. Bilavarn S, Gogniat G, Philippe JL, Bossuet L (2003) Fast prototyping of reconfigurable architecturesfrom a C program. In: IEEE symposium on circuits and systems

16. Zabel M, Kohler S, Zimmerling M, Preuber T, Spallek R (2005) Design space exploration of coarse-grainreconfigurable dsps. In: International conference on reconfigurable computing and FPGAs. ReConFig2005, pp 8–15. doi:10.1109/RECONFIG.2005.15

http://dx.doi.org/10.1109/TELSKS.2005.1572078

http://dx.doi.org/10.1145/1455229.1455240

http://dx.doi.org/10.1109/RECONFIG.2005.15


17. Mehdipour F, Noori H, Zamani M, Inoue K, Murakami K (2008) Design space exploration for a coarsegrain accelerator. In: Design automation conference, 2008. ASPDAC 2008. Asia and South pacific,pp 685–690. doi:10.1109/ASPDAC.2008.4484039

18. Shehan B, Jahr R, Uhrig S, Ungerer T (2010) Reconfigurable grid alu processor: optimization and designspace exploration. In: 13th Euromicro conference on digital system design: architectures, methods andtools (DSD), 2010, pp 71–79. doi:10.1109/DSD.2010.28

19. Bossuet L, Gogniat G, Philippe JL (2005) Generic design space exploration for reconfigurable architec-tures. In: IEEE IPDPS reconfigurable architectures workshop (RAW)

20. Kim Y, Mahapatra R, Choi K (2010) Design space exploration for efficient resource utilization in coarse-grained reconfigurable architecture. IEEE Trans Very Large Scale Integr (VLSI) Syst 18(10):1471–1482.doi:10.1109/TVLSI.2009.2025280

21. Sotiropoulou CL, Nikolaidis S (2010) Design space exploration for fpga-based multiprocessing systems.In: 17th IEEE international conference on electronics, circuits, and systems (ICECS), pp 1164–1167.2010. doi:10.1109/ICECS.2010.5724724

22. Irturk A, Benson B, Mirzaei S, Kastner R (2008) An fpga design space exploration tool for matrixinversion architectures. In: Symposium on application specific processors, 2008. SASP 2008, pp 42–47.doi:10.1109/SASP.2008.4570784

23. Karuri K, Chattopadhyay A, Chen X, Kammler D, Hao L, Leupers R, Meyr H, Ascheid G (2008) Adesign flow for architecture exploration and implementation of partially reconfigurable processors. IEEETrans Very Large Scale Integr (VLSI) Syst 16(10):1281–1294. doi:10.1109/TVLSI.2008.2002685

24. Chattopadhyay A, Chen X, Ishebabi H, Leupers R, Ascheid G, Meyr H (2008) High-level modelling andexploration of coarse-grained re-configurable architectures. In: Design, automation and test in Europe,2008. DATE ’08, pp 1334–1339. doi:10.1109/DATE.2008.4484864

25. Bauer L, Shafique M, Henkel J (2009) Cross-architectural design space exploration tool for reconfig-urable processors. In: Design, automation test in Europe conference exhibition, 2009. DATE ’09, pp 958–963

26. Mei B, Lambrechts A, Verkest D, Mignolet JY, Lauwereins R (2005) Architecture exploration for areconfigurable architecture template. IEEE Des Test 22:90–101. doi:10.1109/MDT.2005.27

27. Bouwens F, Berekovic M, Kanstein A, Gaydadjiev G (2007) Architectural exploration of the adrescoarse-grained reconfigurable array. In: Proceedings of the 3rd international conference on reconfig-urable computing: architectures, tools and applications, ARC’07. Springer, Berlin, pp 1–13. http://dl.acm.org/citation.cfm?id=1764631.1764633

28. Sun K, Pan X, Wang J, Ping L (2007) Pad: a design space exploration model for reconfigurable systems.In: Fourth international conference on information technology, 2007, ITNG ’07, pp 964–965. doi:10.1109/ITNG.2007.146

29. Miramond B, Delosme JM (2005) Design space exploration for dynamically reconfigurable architec-tures. In: Proceedings design, automation and test in Europe, 2005, vol 1, pp 366–371. doi:10.1109/DATE.2005.118

30. Clark N, Blome J, Chu M, Mahlke S, Biles S, Flautner K (2005) An architecture framework for transpar-ent instruction set customization in embedded processors. SIGARCH Comput Archit News 33(2):272–283. doi:10.1145/1080695.1069993. http://doi.acm.org/10.1145/1080695.1069993

31. Wirthlin MJ, Hutchings BL (1995) A dynamic instruction set computer. In: Proc of FCCM32. Cong J, Fan Y, Han G, Zhang Z (2004) Application-specific instruction generation for configurable

processor architectures. In: Proc of ISFPGA33. Mbaye M, Belanger N, Savaria Y, Pierre S (2005) Application specific instruction-set processor gen-

eration for video processing based on loop optimization. In: International symposium on circuits andsystems (ISCAS 2005). IEEE Press, New York, pp 515–3518

34. Mbaye M, Belanger N, Savaria Y, Pierre S (2007) A novel application-specific instruction-set processordesign approach for video processing acceleration. J VLSI Signal Process Syst 47(3):297–315

35. Vogt T, Wehn N (2008) A reconfigurable application specific instruction set processor for convolutionaland turbo decoding in a sdr environment. In: Design, automation and test in Europe, DATE 2008. IEEEPress, New York, pp 38–43

36. Guan X, Fei Y, Lin H (2011) Hierarchical design of an application-specific instruction set processor forhigh-throughput and scalable fft processing. IEEE Trans Very Large Scale Integr (VLSI) Syst PP(99):1–13. doi:10.1109/TVLSI.2011.2105512

37. Shen Z, He H, Zhang Y, Sun Y (2007) A video specific instruction set architecture for asip design. VLSIDes 2007(2):1–7. doi:10.1155/2007/58431

38. Fanucci L, Cassiano M, Saponara S, Kammler D, Witte EM, Schliebusch O, Ascheid G, Leupers R,Meyr H (2006) Asip design and synthesis for non linear filtering in image processing. In: Proceedingsof the conference on design, automation and test in Europe (DATE), Leuven, Belgium. European Designand Automation Association, Grenoble, pp 233–238

http://dx.doi.org/10.1109/ASPDAC.2008.4484039

http://dx.doi.org/10.1109/DSD.2010.28

http://dx.doi.org/10.1109/TVLSI.2009.2025280

http://dx.doi.org/10.1109/ICECS.2010.5724724

http://dx.doi.org/10.1109/SASP.2008.4570784


http://dx.doi.org/10.1109/DATE.2008.4484864

http://dx.doi.org/10.1109/MDT.2005.27

http://dl.acm.org/citation.cfm?id=1764631.1764633

http://dl.acm.org/citation.cfm?id=1764631.1764633

http://dx.doi.org/10.1109/ITNG.2007.146

http://dx.doi.org/10.1109/ITNG.2007.146



http://dx.doi.org/10.1145/1080695.1069993

http://doi.acm.org/10.1145/1080695.1069993


http://dx.doi.org/10.1155/2007/58431


39. Brisk P, Verma AK, Ienne P (2007) Optimal polynomial-time interprocedural register allocation forhigh-level synthesis and asip design. In: Proc of the international conference on computer-aided design(CCAD). IEEE Press, Piscataway, pp 172–179

40. Dinh Q, Chen D, Wong MDF (2008) Efficient asip design for configurable processors with fine-grainedresource sharing. In: Proceedings of the international symposium on field programmable gate arrays(ISFPGA). ACM, New York, pp 99–106. http://doi.acm.org/10.1145/1344671.1344687

41. Mehta G, Jones A (2010) An architectural space exploration tool for domain specific reconfigurablecomputing. In: IEEE international symposium on parallel distributed processing, workshops and phdforum (IPDPSW), 2010, pp 1–8. doi:10.1109/IPDPSW.2010.5470735

42. Micheli GD (1994) Synthesis and optimization of digital circuits. McGraw-Hill, New York43. Hoare R, Jones AK, Kusic D, Fazekas J, Foster J, Tung S, McCloud M (2006) Rapid VLIW processor

customization for signal processing applications using combinational hardware functions. EURASIP JAppl Signal Process 46:472 (23 pages)

44. Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (2006) Extensible markup language (xml)1.0 (fourth edition)—origin and goals. Tech Rep 20060816, World Wide Web Consortium

45. Ihrig CJ, Baz M, Stander J, Hoare RR, Norman BA, Prokopyev O, Hunsaker B, Jones AK (2008) Greedyalgorithms for mapping onto a coarse-grained reconfigurable fabric. I-Tech Education and Publishing,Vienna

http://doi.acm.org/10.1145/1344671.1344687

http://dx.doi.org/10.1109/IPDPSW.2010.5470735

Date post:	23-Dec-2016
Category:	Documents
Upload:	alex-k
View:	213 times
Download:	0 times

Implementation and validation of architectural space exploration techniques for domain-specific...

Documents