Juanjo Noguera and Rosa M. Badiahpc.ac.upc.edu/PDFs/dir06/file000267.pdf730 IEEE TRANSACTIONS ON...

730 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006

System-Level Power-Performance Tradeoffs forReconfigurable Computing

Juanjo Noguera and Rosa M. Badia

Abstract—In this paper, we propose a configuration-aware data-partitioning approach for reconfigurable computing. We showhow the reconfiguration overhead impacts the data-partitioningprocess. Moreover, we explore the system-level power-performancetradeoffs available when implementing streaming embedded ap-plications on fine-grained reconfigurable architectures. For acertain group of streaming applications, we show that an effi-cient hardware/software partitioning algorithm is required whentargeting low power. However, if the application objective is per-formance, then we propose the use of dynamically reconfigurablearchitectures. We propose a design methodology that adapts thearchitecture and algorithms to the application requirements. Themethodology has been proven to work on a real research platformbased on Xilinx devices. Finally, we have applied our methodologyand algorithms to the case study of image sharpening, which isrequired nowadays in digital cameras and mobile phones.

Index Terms—Hardware/software (HW/SW) codesign, power-performance tradeoffs, reconfigurable computing (RC).

I. INTRODUCTION AND MOTIVATION

RECONFIGURABLE computing (RC) [5] is an interestingalternative to application-specific integrated circuits

(ASICs) and general-purpose processors in order to implementembedded systems, since it provides the flexibility of softwareprocessors and the efficiency and throughput of hardware copro-cessors. Programmable-system-on-chips have become a reality,combining a wide range of complex functions on a single die.An example is the Virtex-II Pro from Xilinx, which integrates acore processor (PowerPC405), embedded memory, and config-urable logic.1 Additionally, the importance of having on-chipprogrammable logic regions in system-on-chip (SoC) platformsis becoming increasingly evident. Partitioning an applica-tion among software and programmable logic hardware cansubstantially improve performance, but such partitioning canalso improve power consumption by performing computationsmore effectively and by allowing for longer microprocessorshutdown periods.

Dynamic reconfiguration [25] has emerged as a partic-ularly attractive technique to increase the effective use of

Manuscript received July 2, 2005; revised January 9, 2006. This work wassupported by the CICYT under Project TIN2004-07739-CO2-01 and by DURSIunder Project 2001SGR00226.

J. Noguera was with the Computer Architecture Department, Tech-nical University of Catalonia, 08034 Barcelona, Spain. He is now withXilinx Research Laboratories, Saggart, Co. Dublin, Ireland (e-mail:[email protected]; [email protected]).

R. M. Badia is with the Computer Architecture Department, Technical Uni-versity of Catalonia, 08034 Barcelona, Spain (e-mail: [email protected]).

Digital Object Identifier 10.1109/TVLSI.2006.878343

1[Online]. Available: http://www.xilinx.com/virtex2pro

programmable logic blocks, since it allows the change of thedevice configuration on the fly during application execution.However, this attractive idea of time-multiplexing the neededdevice configuration does not come for free. The reconfig-uration overhead has to be minimized in order to improveapplication performance. Temporal partitioning [16] and con-text scheduling [9] can be used to minimize this penalty.

We could summarize that the system-level approachesto reconfigurable computing could be divided in two broadcategories: 1) hardware/software (HW/SW) partitioning forstatically reconfigurable architectures and 2) temporal parti-tioning and context scheduling2 for dynamically reconfigurablearchitectures.

On the other hand, energy-efficient computation is a majorchallenge in embedded systems design, especially if portable,battery-powered systems (e.g., mobile phones or digital cam-eras) are considered [1]. It is well known that the memory hier-archy is one of the major contributors to the system-level powerbudget [1], [22]. Thus, the way we do partition the data betweenon-chip or off-chip memory will impact the overall system-levelpower consumption.

In this paper, we investigate the power-performance trade-offs for these two system-level approaches to RC. We showthat, when targeting streaming applications, the use of a givenapproach (i.e., HW/SW partitioning or context scheduling)depends on the application requirements (i.e., power or per-formance). Moreover, we propose that, in the reconfigurationcontext scheduling approach, the reconfigurable architectureshould process large blocks of data, which should be storedin external memory resources. The execution of large blocksof data minimizes the reconfiguration overhead, but it alsoincreases the power consumption due to the use of externalmemory. On the other hand, the HW/SW partitioning-basedapproach should process small blocks of data that can be storedin on-chip memory, which means that we reduce the overallsystem power consumption.

The paper is organized as follows. Section II explains therelated work. In Section III, we introduce our target architec-ture. The proposed design methodology for embedded systemsis presented in Section IV. Section V introduces the conceptof configuration-aware data partitioning. In Section VI, we ex-plain the benchmarks, the experimental setup, and the obtainedresults. Finally, the conclusions of this paper are presented inSection VII.

2In this paper, context scheduling refers to the scheduling of: 1) tasks’ execu-tions and 2) in partially (not multicontext) reconfigurable devices, the reconfig-uration processes of the configurable blocks.

1063-8210/$20.00 © 2006 IEEE

NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 731

II. PREVIOUS WORK

HW/SW partitioning for reconfigurable computing has beenaddressed in several research efforts [2], [6], [8]. An integratedalgorithm for HW/SW partitioning and scheduling, temporalpartitioning, and context scheduling is presented in [2]. On theother hand, context scheduling has been also widely addressedin many publications [9], [16], [21], [23]. However, none ofthese papers introduces any power-performance tradeoffs.

A review of design techniques for system-level dynamicpower management can be found in [1]. In addition, a surveyon power-aware design techniques for real-time systems isgiven in [22]. However, none of these papers considers the useof reconfigurable architectures.

Power consumption of field-programmable gate-array(FPGA) devices has been addressed in several research efforts[3], [19], [24]. In addition, several power estimation modelshave been proposed [7], [15]. However, all of these approachesstudy the power requirements at the device level and not at thesystem level.

Recently, very few research efforts have addressed low-powertask scheduling for dynamically reconfigurable devices. Thetechnique proposed in [10] tries to minimize power consump-tion during reconfiguration by minimizing the number ofbit changes between reconfiguration contexts. However, nopower-performance tradeoffs and power measurements arepresented. More recently, in [12], it was shown that configura-tion prefetching and frequency scaling could reduce the energyconsumption without affecting performance. However, thispaper does not cover the benefits of HW/SW partitioning.

Additional techniques are given in [18] and [20]. A techniquefor application partitioning between configurable logic and anembedded processor is given in [20]. This paper shows thatsuch partitioning helps to improve both performance and en-ergy. However, the paper only considers statically configurablelogic and does not consider dynamically reconfigurable archi-tectures. A different approach for coarse-grained RC is pre-sented in [18]. In the paper, a data-scheduler algorithm is pro-posed to reduce the overall system energy. However, the paperdoes not consider the benefits of HW/SW partitioning.

A. Contributions of This Work

This paper explores the system-level power-performancetradeoffs for fine-grained reconfigurable computing. Morespecifically, the paper compares, in terms of energy savings andperformance improvements, the two key approaches existingin reconfigurable computing: 1) partitioning an applicationbetween software and configurable hardware and 2) contextscheduling for dynamically reconfigurable architectures. To thebest of our knowledge, this open issue has not been addressedin previous research efforts.

In addition, the study presented in this paper focuses ona data-size-based partitioning approach for streaming appli-cations. This is different from the majority of the traditionalHW/SW partitioning and context scheduling approaches in theliterature focused on task graph dependency analysis.

Fig. 1. Dynamically reconfigurable CMP architecture.

Fig. 2. (a) Dynamically reconfigurable processor. (b) Architecture of the L2on-chip memory subsystem.

III. TARGET ARCHITECTURE

The target architecture is a heterogeneous architecture,which includes an embedded processor, a given number ofdynamically reconfigurable processors (DRPs), an on-chip L2multibank memory subsystem, and external DRAM memoryresources. An example of this architecture is shown in Fig. 1,where we can see a four-DRP-based architecture. This architec-ture follows the chip multiprocessor (CMP) paradigm. The datathat must be transferred between tasks executed in the DRPprocessors are stored in the on-chip L2 memory subsystem.Each DRP processor can be independently reconfigured. Theproposed target architecture supports multiple reconfigurationsrunning concurrently, which is not the case for most of thearchitectures proposed in the literature.

Each DRP processor has a local L1 memory buffer. A hard-ware-based data prefetching mechanism is proposed to hide thememory latency. Each DRP has a point-to-point link to the L2buffers (in order to simplify Fig. 1, this is not shown in the pic-ture). However, this is shown in Fig. 2(a), which shows the in-ternal architecture of a DRP processor. There are three maincomponents in this architecture: 1) the load unit; 2) the storeunit; and 3) the dynamically reconfigurable logic. The DRPsare single-context devices. It can be observed in Fig. 2(a) thatthe load and store units have internal L1 data buffers. As it isshown in the picture, each unit (i.e., load and store) has two in-ternal buffers. This approach enables the possibility of havingthree processes running concurrently: 1) the load unit receivingdata for the next computation; 2) the reconfigurable logic isprocessing data from a buffer in the load unit and storing thisprocessed data in a buffer of the store unit; and 3) the store


Fig. 3. Design methodology for embedded systems.

unit is sending the previous processed data to the L2 memorysubsystem.

The on-chip L2 memory subsystem is based on a multibankapproach [see Fig. 2(b)]. Each one of these banks is logically di-vided into two independent subbanks (i.e., this enables readingfrom one subbank while concurrently writing to the other sub-bank of the same physical bank). These buffers interact fromone side with the data prefetch units [see the left-hand sideof Fig. 2(b)] using a crossbar and from the other side with anon-chip bus that interacts with the external DRAM memory con-troller. In this L2 memory subsystem, there must be as manydata prefetch units as DRP processors.

The proposed architecture includes for each DRP a dedicatedhardware-based configuration prefetch unit. This is not shownin the pictures in order to simplify the figures. Thus, the archi-tecture supports the transfer of data in one DRP overlapped withthe reconfiguration of a different DRP.

Each DRP processor has its own clock signal, whichmeans that this is a kind of globally-asynchronous–locally-synchronous (GALS) architecture. The architecture supportsthe use of clock-gating and frequency-scaling techniques forpower consumption minimization independently for each DRP.

IV. DESIGN METHODOLOGY FOR EMBEDDED SYSTEMS

The proposed design methodology is depicted in Fig. 3. Wecan observe that it is divided into three steps: 1) applicationphase; 2) static phase; and 3) dynamic phase.

A. Application Phase

The proposed methodology assumes that the input applica-tion is specified as a task graph, where nodes represent tasks(i.e., coarse-grained computations) and edges represent data de-pendencies. Each edge has a weight to represent the amount ofdata that must be transferred between tasks. Finally, each taskhas an associated task type (i.e., in the task-graph specifica-tion, we could have several tasks implementing the same typeof computation).

B. Static Phase

In this phase, there are four main processes: 1) task-levelgraph transformations; 2) HW/SW synthesis; 3) HW/SW par-titioning; and 4) priority task assignment.

We can apply some task-level graph transformation tech-niques in order to increase the architecture performance. Thesetransformations include: task pipelining, task blocking, andtask (configuration) replication. The output of this step is themodified task graph.

The HW/SW synthesis is the process of implementing thetasks found in the application. The output of this process is a setof estimators. Typical estimators are HW execution time, SWexecution time, HW area, and reconfiguration time. These esti-mators could be obtained using accurate implementation tools(i.e., compiler, logic synthesis, and place&route tools) or usinghigh-level estimation tools.

The HW/SW partitioning process decides which tasks aremapped to hardware or software depending on: 1) the archi-tecture parameters (i.e., the number of DRP processors or ex-ternal DRAM size); 2) the modified task graph; and 3) the task’sestimators. Note that the application requirements do not di-rectly affect the HW/SW partitioning process, but they do af-fect this process indirectly using the modified task graph. Thepartitioning algorithm must take into account the configurationprefetch technique in its implementation.

Finally, in the static phase, we also find the Priority Task As-signment process. In this process, we statically assign to eachtask a priority of execution. This information will be used duringrun-time to decide the execution order of the tasks. An exampleof priority function is the critical-path analysis.

C. Dynamic Phase

This phase is responsible for the scheduling of the tasks butalso for the scheduling of the DRP’s reconfigurations. The TaskScheduler and DRP Context Scheduler cooperate and run in par-allel during application run-time execution. Their functionalityis based on the use of a look-ahead strategy into the list of tasksready for execution (i.e., tasks which predecessors have finishedits execution). At run-time, the task scheduler assigns tasks toDRPs and decides the execution order of the tasks found in thelist of ready for execution. The DRP context (configuration)scheduler is used to minimize reconfiguration overhead. Theobjective of the DRP context scheduler is to decide: 1) whichDRP processor must be reconfigured and 2) which reconfigu-ration context, or hardware task from the list of tasks ready forreconfiguration (i.e., tasks which predecessors have initiated itsexecution), must be loaded in the DRP processor. This schedulertries to minimize this reconfiguration overhead by overlappingthe execution of tasks with DRP reconfigurations. These algo-rithms are implemented in hardware using the dynamic sched-uling unit (DSU) found in our architecture (see Fig. 1) [13]. Sev-eral research efforts in the field of SoC design propose movinginto hardware functionality that traditionally has been assignedto operating systems [17].

V. CONFIGURATION-AWARE DATA PARTITIONING

Here, we explain how, depending on the application require-ments (e.g., power or performance), the reconfiguration over-


Fig. 4. (a) Task graph for this example. (b) Sequential scheduling. (c) Sched-uling with configuration prefetching. (d) Scheduling with data partitioning andconfiguration prefetching.

head impacts the data-partitioning process. Moreover, we showthat the proposed data-partitioning technique highly influencesthe HW/SW partitioning results. Finally, we explain the power-performance design tradeoffs that are involved in the data-par-titioning technique.

A. Introduction and Motivation

In our approach, we want to execute an application that ismodeled as a task graph on a hybrid reconfigurable architec-ture with a given number of DRP processors, each one of themcharacterized by the reconfiguration time. Moreover, the appli-cation must process an input data set of a given fixed size. Inmany streaming embedded applications, we could assume thatthe execution time of the application is proportional to the sizeof data that has to be processed. In other words, this means thatthe execution time of the tasks that are found in the applica-tion is proportional to the amount of data that each task has toprocess. The data-partitioning process that we are proposing as-sumes that the execution time of the tasks is longer than the DRPreconfiguration time.

Obviously, there are several alternatives when scheduling anapplication on a dynamically reconfigurable architecture. InFig. 4, we can observe three possible solutions for an applica-tion with five tasks and a three-DRP-based architecture. Thetask graph used in this example is shown in Fig. 4(a), where

we can also observe for each task its execution time. In thefollowing paragraphs, we explain these three possible solutions.

1) Sequential Scheduling: This is the simplest solution,where task executions and DRP reconfigurations are sequen-tially scheduled in the DRPs [see Fig. 4(b)]. We can observethat the execution time of the tasks is longer than DRP recon-figurations (shown as a shadowed “R” in the figure). Finally,we should also notice the performance penalty due to thereconfiguration overhead.

2) Scheduling With Configuration Prefetching: Configura-tion caching [27] and configuration prefetching [4] are well-known mechanisms in reconfigurable computing to hide the re-configuration overhead. Configuration prefetching is based onthe idea of loading the required configuration on a DRP beforeit is actually required, thus overlapping execution in a DRP withreconfiguration in a different DRP. In our approach, the config-uration prefetching of a task could start when all its predecessortasks have started their execution. For instance, the configura-tion prefetching of task T2 could start after task T1 has begunits execution. On the other hand, the execution of a task mightstart when all of its predecessors have finished their execution(task T2 can start when task T1 has finished). As we can observein Fig. 4(c), this technique hides completely the reconfigurationoverhead to all DRP processors, thus improving the applicationperformance.

This approach is based on the idea that the task graph is exe-cuted only one time (i.e., each task processes all the input dataset). The benefit of this approach is that it requires the minimumnumber of DRP reconfigurations (e.g., five reconfigurations inthis example). However, this approach has two main drawbacks.

• The size of the shared memory buffers used for task com-munication is large (it must be able to store the maximumdata size required by all the tasks).

• The DRP processors are waiting for their incoming dataduring a significant amount of time (i.e., they have finishedthe reconfiguration but cannot start execution because theinput streams are not in the shared memory buffers).

3) Scheduling With Configuration Prefetching and Data Par-titioning: This approach tries to overcome the limitations of theprevious approach. This solution also uses the concept of con-figuration prefetching, but the input data set is not processed allat the same time. In this sense, the input data set is partitionedin several data blocks of a given size. This also means that thetask graph must be iterated as many times as the number of inputdata blocks. In the example shown in Fig. 4(d), we can observethat the input data set has been partitioned into two data blocks(named “0” and “1”) and that the task graph is iterated twice.This technique reduces the size of the shared memory buffers re-quired for task communication. Moreover, we can also see thatthe latency from DRP reconfiguration to DRP execution is alsoreduced. However, this approach has the drawback that it in-creases the number of reconfigurations because the task graphmust be iterated several times. For example, in Fig. 4(d), wecan see that we now have nine reconfigurations compared withthe five reconfigurations required in Fig. 4(c). In addition, thistechnique also impacts performance, since in this approach wecan not use reconfiguration prefetch among two iterations of thetask graph.


Fig. 5. Example of how increasing the amount of processed data might help tominimize the reconfiguration overhead.

B. Model for the Reconfiguration Overhead

It has been demonstrated that the parameters of the recon-figurable architecture (i.e., number of DRP processors or re-configuration time) have a direct impact into the performancegiven by the HW/SW partitioning process [11]. The partitioningprocess must take into account the reconfiguration time and theconfiguration prefetching technique for reconfiguration latencyminimization. This is summarized in the following expression,which shows how the execution time3 of a task mapped to hard-ware must be modified to consider the reconfiguration overhead(see also Fig. 5):

(1)

where• is the execution time of task without any recon-

figuration overhead;• is the probability of reconfiguration, which is a function

of the number of tasks mapped to hardware and the numberof DRP processors;

• is the reconfiguration time needed for a DRP processorto change its context (configuration);

• is the average executing time for all tasks.On the other hand, in the design of embedded systems, we

would like to minimize the number of accesses to externalmemory, in order to reduce the overall system-level powerconsumption. Thus, data transfers between tasks should be keptto a size that fits into the on-chip L2 memory.

In many streaming embedded applications, we could assumethat the execution time of a given task implemented in hardwareor software is proportional to the size of the data that must beprocessed. Thus, if the data are stored in on-chip memory with

3In this paper, the execution time of a task includes the time required to: 1)read the data from memory; 2) process the data; and 3) write the processed databack to memory.

a smaller capacity, then we could conclude that the average exe-cution time of the tasks will be smaller when comparedwith the reconfiguration time (we are assuming reconfigurationtimes in the order of 800 s–1.4 ms). If this is the case, and ap-plying expression (1), we will have a significant reconfigurationoverhead (because ), which may prevent movingthe task from software to hardware. In order to overcome thislimitation and reduce the reconfiguration overhead, we could in-crease the amount of data to be processed by the task. Increasingthe amount of data means that we will be forced to use externalmemory. Using this approach, we increase the performance (be-cause more tasks could be mapped to hardware) but we also in-crease the overall system-level power consumption.

An example of this previous concept [see (1)] can be observedin Fig. 5, where we consider the execution of two tasks. Thus,in Fig. 5(b), we can see that although using the reconfigurationprefetching technique, we cannot completely hide the reconfig-uration overhead for task T2, since task T1 has a shorter execu-tion time because it processes ten data units [see Fig. 5(a)]. Aspreviously introduced, this might be improved by increasing theamount of processed data. In this example, we have increasedthe amount of processed data to twenty data units [see Fig. 5(c)],which in fact increases the execution time of task T1 in such amanner that it equals the reconfiguration time for task T2, hencecompletely hiding the reconfiguration overhead [see Fig. 5(d)].

C. Data Partitioning for Reconfigurable Architectures

How the input data set is partitioned will mainly drive the useof a given approach: 1) HW/SW partitioning for statically re-configurable architectures using on-chip memory or 2) contextscheduling for dynamically reconfigurable architectures usingoff-chip memory.

Thus, the input streaming data set must be partitioned intoseveral blocks, and the size of these blocks is mainly driven bythe objectives of the application (i.e., power or performance).Consequently, if the application objective is performance, thenwe should process large blocks of data because we want to min-imize the reconfiguration overhead. Moreover, if we are pro-cessing large blocks of data, it is more likely that these blocksdo not fit in the on-chip L2 memory subsystem and we are forcedto use off-chip memory, thus increasing the overall system-levelpower consumption.

On the other hand, we have the situation where the appli-cation objective is low power. In this case, we must processsmall blocks of data so that they could be stored in the on-chipmemory, thus minimizing the system-level power consumption.The drawback of this solution is that, if we process small blocksof data, then the reconfiguration overhead would be moresignificant, and this might prevent mapping more tasks on therun-time reconfigurable hardware. As a summary, processingsmall blocks of data, which are stored in on-chip memory, re-duces the power consumption but it also reduces the applicationperformance.

In addition, the type of on-chip/off-chip data partitioninggives us the number of iterations of the task graph. This is shownin Fig. 6, where we can observe an example of an image-pro-cessing application. In this figure, we observe three differentimage sizes (i.e., 256 256, 512 512, and 768 768).


Fig. 6. (a) Initial task graph. (b) Data partitioning for dynamically reconfig-urable architectures. (c) Data partitioning for HW/SW partitioning.

In Fig. 6(b), we can observe that the size of the blocks of datais large (e.g., 256 256). The amount of data to be processedmust be such that the task execution time at least equals thereconfiguration time. In this situation, the number of task graphiterations is reduced. For example, when we must process animage size of 512 512, we must iterate the task graph fourtimes, since we are processing blocks of size 256 256 pixels.In the opposite case, we have the situation where we processsmall blocks of data [e.g., 64 64 as shown in Fig. 6(c)], but wehave a large number of iterations of the task graph. For example,if we want to process an input image of 256 256, then wemust iterate 16 times the task graph. This example assumes thatthe data are partitioned into squared blocks, but the input dataset (e.g., image) could have been also partitioned into blocks ofseveral rows or columns.

The authors would like to clarify at this point that the tech-niques proposed in this paper might not be applied to all kindsof streaming applications. There might be other type of applica-tions where this idea of block-based data partitioning and pro-cessing is not possible (e.g., video coding).

VI. EXPERIMENTS AND RESULTS

A. Image Sharpening Benchmarks

The proposed dynamically reconfigurable architecture is ad-dressing streaming data (computationally intensive) embeddedapplications, that is, applications with a large amount of data-level parallelism. It is not the goal of the proposed architectureto address control-dominated applications.

Image-processing applications are a good example of the typeof applications that we are addressing. This kind of applicationis becoming more and more sensible for power consumption,especially if we consider the increasing market share of digitalcameras or mobile phones with embedded cameras, which re-quire this type of image-processing technique. In this sense, wehave selected three applications that implement an image sharp-ening application (see Fig. 7). The three benchmarks follow thesame basic process: 1) transform the input image from RGB toYCrCb color space; 2) image quality improvements processing

Fig. 7. Image sharpening benchmarks. (a) Unsharp masking. (b) Sobel filter.(c) Laplacian filter.

Fig. 8. Galapagos prototyping platform.

the luminance (mainly using sliding window operations like3 3 linear convolutions); and 3) transform from YCrCb backto RGB color space.

Three different input data sets (image sizes) have been used inthe experiments: 1) 256 256; 2) 512 512; and 3) 768 768.

B. Prototype Implementation

A prototype of the proposed architecture has been designedand implemented. The Galapagos system is a PCI-based system(64 b/66 MHz). It is based on leading-edge FPGAs from Xilinxand high-bandwidth DDR SDRAM memory (see the left-handside of Fig. 8). This reconfigurable system is based on a Virtex-IIPro device. The device used is a XC2VP20, which includes twoPowerPC processors. The dynamically scheduling unit (DSU,in Fig. 1) and the data prefetch units of the L2 memory sub-system [see Fig. 2(b)] have been mapped to the Virtex-II pro de-vice, which also includes the SDRAM memory controller. Thedesign of these blocks has been done in verilog HDL, and theimplementation has been done using Synplicity (synthesis) andXilinx (place&route) tools.

The DRP processors of our architecture are implementedin the Galapagos system using three Virtex-II devices (i.e.,XC2V1000). The load and store units have been implementedusing Virtex-II on-chip memory. The size of the buffers in theload/store units is 2 KB each buffer (i.e., 4 KB for each unit).The width of the memory words is 64 b. Fig. 8 shows a pictureof the Galapagos system in a PC environment.


Fig. 9. Final placed and routed task on three virtex-II devices: (a) XC2V1000, (b)XC2V500, and (c) XC2V250.

Fig. 10. HW/SW task execution time.

C. Task Performance Results

Fig. 10 shows the execution time of the tasks for the unsharpmasking application running on: 1) an embedded processor,PowerPC405 (300 MHz), which processes blocks of data of64 64 pixels; 2) a DRP processor from the Galapagos System(60 MHz) processing blocks of 64 64 pixels; and 3) a DRPprocessor processing blocks of data of 256 256 pixels.

It is interesting to note the order of magnitude that has beenobtained in the implementation of the blur task (3 3 linearconvolution). It is not the objective of this paper to explain thedetails of the implementation of the several tasks in hardware.These tasks have been designed in verilog HDL, simulated usingModelsim, and implemented using Synplicity (synthesis) andXilinx (place&route) tools.

In order to reduce the reconfiguration overhead, we have usedthe partial reconfiguration capability of the Virtex-II devices[26]. In this sense, the Virtex-II resources used by the hardwaretasks have been fixed to be in the center of the device, wherewe time-multiplexed the required task (see Fig. 9). The left andright sides of the device are used by the DRP’s load and storeunits, which are not run-time reconfigured [see Fig. 2(a)]. Wehave implemented the DRP processors in three different XilinxVirtex-II devices (i.e., XC2V250, XC2V500, and XC2V1000),which mainly differ in the amount of hardware area used by thereconfigurable unit.

Using this capability of the Virtex-II devices, we have ob-tained, using a reconfiguration clock of 66 MHz, the followingaverage reconfiguration times for the three devices: a) 949 sfor a XC2V250; b)1087 s for a XC2V500; and c) 1337 s fora XC2V1000.

Fig. 11. Hardware/software task power-consumption.

D. Task Power Results

Fig. 11 shows the power consumption for a Galapagos DRPin its several states using on-chip or off-chip memory. More-over, we can also observe the power consumption for the em-bedded PowerPC405 processor, which is used to execute thetasks mapped to software. In Fig. 11, we give three values forthe DRP’s power consumption (a different value for each XilinxVirtex-II device).

These power consumption values for the several Virtex-II de-vices have been obtained using Xpower,4 which is the powerestimation tool from Xilinx. Moreover, using XPower, we haveestimated the power consumption for the on-chip memory. Fi-nally, the power consumption for the off-chip memory (i.e., ex-ternal DRAM) has been obtained from Micron datasheets.5 Wehave used two memory chips of 64 MByte running at 100 MHz.In the following paragraphs, we explain the DRP processor’spower consumption in its several states.

• The power consumption in the idle/wait state represents:1) the static (i.e., leakage) power associated with a com-plete DRP processor (i.e., Load, Store and Reconfigurableunits) and 2) the static power taken by the on-chip or off-

4[Online]. Available: http://www.xilinx.com/virtex5[Online]. Available: http://www.micron.com


chip memory resources. Clearly, the static power increaseswhen we: 1) use external memory or 2) increase the size ofthe device (i.e., increase the hardware area).

• The power in the reconfiguration state includes: 1) thepower of the DRP processor itself; from Xilinx, we haveobtained that this power consumption is mainly drivenby the device leakage power; 2) the dynamic power con-sumption of the L2 configuration prefetch unit; and 3) thedynamic power from the on-chip or off-chip memory re-sources (in this latest situation, we also include the powerconsumption from the I/O buffers). It is interesting tonote that the dynamic power taken by the Xilinx Virtex-IIdevices during the reconfiguration process can be ignored,that is, the static (i.e., leakage) power is that significant thatthe dynamic power turns into noise—keep in mind thatduring the reconfiguration process only a minor amountof logic is actually switching, since the reconfigurationcontext (i.e., bitstream) is sequentially loaded into thereconfigurable hardware.

• The power consumption in the execution state takes ac-count of: 1) the static and dynamic power of the full DRPprocessor (i.e., Load, Store and Reconfigurable units);2) the dynamic power from the L2 data prefetch units; and3) the power consumption of the associated (i.e., on-chipor off-chip) memory resources. As for the previous case,when dealing with external memory we also take intoaccount the power consumption of the I/O buffers (e.g.,LVTTL 3.3 V). The DRP power consumption in executionis an average power obtained when the tasks from theunsharp masking application run at 60 MHz. This averagepower consumption has been obtained implementinga gate-level accurate simulation after the place&routeprocess for all the tasks.

Finally, let us briefly explain the power consumption of theembedded CPU. According to Xilinx, the PowerPC405 takes0.9 mW/MHz.6 Assuming a clock frequency of 300 MHz, thenwe obtain a power consumption of 270 mW. We should also addhere the power consumption of the data prefetch units attachedto the embedded processor.

E. Energy–Performance Tradeoff Results

In this subsection, we explain the energy–performancetradeoffs results obtained when applying the proposed config-uration-aware data-partitioning technique. The performanceresults have been obtained from real executions on the Gala-pagos system. The execution generates a log file with the statechanges of the Virtex-II devices and embedded PowerPC. Wehave obtained the energy from: 1) the power consumption ofthe components as described in Fig. 11 and 2) the execution logfile, which gives information about the amount of time that adevice has been in a given state.

Fig. 13(a) shows the performance results and Fig. 13(b)shows the energy consumption results for the unsharp maskingapplication. In all pictures, we can observe the obtained resultsfor the 768 768 image size, when we change the target device

6[Online]. Available: http://www.xilinx.com/virtex2pro

Fig. 12. Unsharp masking HW/SW task partitioning.

(i.e., we show the results for the three Virtex-II devices). Inaddition, we present the following four implementations.

• Software implementation (named seq_sw); this imple-mentation is based on the use of the embedded Pow-erPC405—remember that the associated performance andpower results are shown in Figs. 10 and 11, respectively.In this experiment, we assume that the input imageshave been partitioned on blocks of 64 64 pixels (i.e.,remember that since we have partitioned the input imagein several blocks, we must iterate the task graph severaltimes—for example, 16 times in the case of an image with256 256 pixels).

• HW/SW partitioning (named seq_hw_sw): in this ap-proach, we use on-chip memory since we process smalldata blocks (i.e., 64 64 pixels). Moreover, we haveused the HW/SW partitioning algorithm proposed in [11],assuming we use two or three DRP processors and theaverage reconfiguration times introduced in the previoussubsection. The obtained partitioning can be observed inFig. 12, where we see that the reconfiguration overheadprevents us from moving into HW more tasks than thenumber of available DRP processors.

• Dynamic reconfiguration (named seq_dr): In this case, weincrease the size of the data blocks to process. Specifi-cally, we process blocks of 256 256 pixels, which meansthat we must use off-chip memory (i.e., external DRAM).This amount of data implies that the tasks’ execution timeis pretty more similar to the DRP reconfiguration time.This fact means that, when we apply the HW/SW parti-tioning algorithm, all tasks are mapped to the reconfig-urable hardware.

• Hardware implementation (named seq_hw): This approachassumes that: 1) we use five DRP processors and 2) weuse on-chip memory, since we process blocks of data of64 64 pixels. This should be considered as the optimumsolution in terms of both power and performance, since: 1)there is no reconfiguration overhead (i.e., we have as manyDRPs as tasks) and 2) we use on-chip memory.

In Fig. 13(a), we show the performance that we have obtainedusing the four implementations. We can observe that the soft-ware implementation (i.e., PowerPC405 based solution) obtainsthe worst performance results. The use of the HW/SW parti-tioning approach contributes to a major improvement in per-


Fig. 13. Unsharp masking application. (a) Performance results. (b) Energyresults.

formance, since critical tasks are mapped to the configurablehardware. Obviously, increasing the number of DRP processorshelps to improve performance since more tasks are implementedin hardware (i.e., 29% improvement when moving from two tothree DRP processors). Moreover, it is clear that the reconfigu-ration time does not affect this approach, since there are no re-configurations. The dynamic reconfiguration technique helps toimprove performance even more. Dynamic reconfiguration im-proves the HW/SW partitioning approach by: 1) 62.5% whenusing two DRP processors and 2) 47.3% when using three DRPprocessors. In addition, dynamic reconfiguration improves thesolution based on the embedded CPU by 83.14% Finally, itis worth mentioning that, in the unsharp masking benchmark,the dynamic reconfiguration approach does not benefit from in-creasing the number of DRP processors (i.e., we obtain the sameresults in both situations). Since we are using the unmodifiedlinear task graph, we have enough resources with two DRP pro-cessors to completely hide the reconfiguration overhead (i.e.,one DRP processor is in reconfiguration while the other one isin execution).

On the other hand, Fig. 13(b) shows the energy consump-tion for all four approaches. It is clear that the solution basedon the embedded CPU is the approach that consumes the largeramount of energy. Despite using on-chip memory and requiringthe minimum amount of power (see Fig. 11), the long executiontime of the tasks implemented in the PowerPC405 contribute tothis large energy consumption. Obviously, the hardware-basedapproach is the optimum solution in terms of energy consump-tion, thanks to the use of on-chip memory and the short execu-tion times, which do not have any reconfiguration overhead.

Then, as an intermediate solution, we have the results for themixed HW/SW and dynamic reconfiguration approaches. Wemust first observe, in both approaches, that the energy increaseswhen: 1) having fixed the number of DRP processors, we in-crease the size of the reconfigurable unit (e.g., we move fromtwo XC2V250 to two XC2V500 devices) or 2) having fixed agiven Virtex-II device, we increase the number of DRP proces-sors (i.e., we move from two to three DRP processors). In bothsituations, this increment of the energy is due to the incrementof the static (i.e., idle) leakage power, which comes with the in-crement of the hardware area.

From Fig. 13(b), we can observe that, independently of thenumber of DRP processors, the mixed HW/SW solution re-quires less energy than the dynamic reconfiguration approachdoes,7 that is, the dynamic reconfiguration approach, despiteits performance advantages, requires more energy due to itshigh power requirements, which comes from the use of off-chipmemory. It is interesting to note here that the dynamic reconfig-uration approach has the same energy requirements for execu-tion and reconfiguration, as in the case where we use two DRPprocessors.

As a summary from Fig. 13(b), we obtain that both solu-tions based on configurable logic give an average 43% of en-ergy reduction when they are compared with energy required bythe embedded CPU implementation. This energy improvementmight be a value up to 60%. Moreover, HW/SW partitioningimproves, in terms of energy consumption, the dynamic recon-figuration approach by 16.4% when using two DRP processorsand 35% when using three DRP processors.

VII. CONCLUSION

In this paper. we have explored the system-level power-per-formance tradeoffs for fine-grained reconfigurable computing.We have proposed a configuration-aware data-partitioning tech-nique for reconfigurable architectures, and we have shown howthe reconfiguration overhead directly impacts this data-parti-tioning process.

When targeting many streaming applications (like the image-processing applications), we have shown that the use of a givenapproach (i.e., HW/SW partitioning for statically reconfigurableor context scheduling for dynamically reconfigurable architec-tures) depends on the application requirements (i.e., power orperformance). Thus, in this type of applications, if the objectiveis energy efficiency, then HW/SW partitioning for statically re-configurable logic is the most favorable solution. On the other

7In the calculation of the energy taken by the dynamic reconfiguration ap-proach, we are assuming that we can completely power off the embedded CPU(i.e., we do not consider the leakage power due to the PowerPC).


hand, if the application objective is performance, then contextscheduling for dynamically reconfigurable architectures is theoptimum solution.

Finally, future work includes the study of the same tradeoffsin a mixed environment, where HW/SW partitioning could beused with context scheduling for dynamically reconfigurable ar-chitectures. Other topics of future research include applying thetechniques proposed in this paper to other types of embeddedapplications and proposing a detailed implementation for the L2memory subsystem.

REFERENCES

[1] L. Benini, A. Bogliolo, and G. De Micheli, “A survey of design tech-niques for system-level dynamic power management,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 3, pp. 299–316, Jun.2000.

[2] K. Chatta and R. Vemuri, “Hardware-software co-design for dynami-cally reconfigurable architectures,” in Proc. FPL, 1999, pp. 175–185.

[3] V. George, H. Zhang, and J. Rabaey, “The design of a low energyFPGA,” in Proc. Int. Symp. ISLPED, 1999, pp. 188–193.

[4] S. Hauck, “Configuration prefetch for single context reconfigurable co-processors,” in Proc. ACM Int. Symp. FPGA, 1998, pp. 65–74.

[5] R. Hartenstein, “A decade of reconfigurable computing: A Visionaryretrospective,” in Proc. DATE, 2001, pp. 642–649.

[6] B. Jeong, “Hardware-software co-synthesis for run-time incrementallyreconfigurable FPGAs,” in Proc. ASP-DAC, 2000, pp. 169–174.

[7] F. Li, D. Chen, L. He, and J. Cong, “Architecture evaluation for powerefficient FPGAs,” in Proc. ACM Int. Symp. FPGA, 2003, pp. 175–184.

[8] Y. Li, “Hardware-software co-design of embedded reconfigurable ar-chitectures,” in Proc. DAC, 2000, pp. 175–184.

[9] R. Maestre, “A framework for reconfigurable computing: Task sched-uling and context management,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 9, no. 6, pp. 507–512, Dec. 2001.

[10] R. Maestre, “Configuration management in multi-context reconfig-urable systems for simultaneous performance and power optimiza-tions,” in Proc. ISSS, 2000, pp. 858–873.

[11] J. Noguera and R. M. Badia, “HW/SW co-design techniques for dy-namically reconfigurable architectures,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 10, no. 4, pp. 399–415, Aug. 2002.

[12] ——, “System-level power-performance trade-offs in task schedulingfor dynamically reconfigurable architectures,” in Proc. CASES, 2003,pp. 73–83.

[13] ——, “Multitasking on reconfigurable architectures: Micro-architec-ture support and dynamic scheduling,” in Proc. ACM TECS, 2004, pp.385–406.

[14] ——, “Power-performance trade-offs for reconfigurable computing,”in Proc. CODES + ISSS , 2004, pp. 116–121.

[15] K. W. Poon, A. Yan, and S. J. E. Wilton, “A flexible power modelfor FPGAs,” in Proc. 12th Int. Conf. Field-Programmable Logic Appl.(FPL), 2002, pp. 312–321.

[16] K. Purna and D. Badia, “Temporal partitioning and scheduling dataflow graphs for reconfigurable computers,” IEEE Trans. Computers,vol. 48, no. 6, pp. 579–590, Jun. 1999.

[17] B. E. Saglam (Akgul) and V. Mooney, “System-on-a-chip processorsynchronization support in hardware,” in Proc. DATE, 2001, pp.633–639.

[18] M. Sánchez-Élez, “A complete data scheduler for multi-context recon-figurable architectures,” in Proc. DATE, 2002, pp. 547–552.

[19] L. Shang, A. S. Kaviani, and K. Bathala, “Dynamic power consumptionin virtex-II FPGA family,” in Proc. Int. Symp. FPGA (FPGA), 2002, pp.157–164.

[20] G. Stitt, F. Vahid, and S. Nemetebaksh, “Energy savings and speedupsfrom partitioning critical software loops to hardware in embedded sys-tems,” in Proc. ACM TECS, 2004, pp. 218–232.

[21] S. Trimberger, D. Carberry, A. Johnson, and J. Wong, “A time-multi-plexed FPGA,” in Proc. 5th IEEE Symp. Field-Programmable CustomComputing Machines (FCCM), 1997, pp. 22–28.

[22] O. S. Unsal and I. Koren, “System-level power-aware design tech-niques in real-time systems,” Proc. IEEE, vol. 91, pp. 1055–1069, Jul.2003.

[23] M. Vasilko and D. Ait-Boudaoud, “Scheduling for dynamically recon-figurable FPGAs,” in Proc. Int. Workshop Logic Arch. Synthesis (IFIPTC10 WG10.5), 1995, pp. 328–336.

[24] K. Weiß, C. Oetker, I. Katchan, T. Steckstor, and W. Katchan, “Powerestimation approach for sram-based FPGAs,” in Proc. 8th ACM Int.Symp. Field-Programmable Gate Arrays (FPGA), 2000, pp. 195–202.

[25] M. J. Wirthlin and B. L. Hutchings, “Improving functional densitythrough run-time circuit reconfiguration,” IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 6, no. 2, pp. 247–256, Jun. 1998.

[26] Two Flows for Partial Reconfiguration: Module Based or Small BitManipulations, Xinlinx Corp., San Jose, CA, 2005, Xilinx ApplicationNote XAPP290.

[27] Z. Li, K. Compton, and S. Hauck, “Configuration caching managementtechniques for reconfigurable computing,” in Proc. 8th IEEE Symp.Field-Programmable Custom Computing Machines, 2000, pp. 22–36.

Juanjo Noguera received the B.Sc. degree incomputer science from the Autonomous Universityof Barcelona, Barcelona, Spain, in 1997, and thePh.D. degree in computer science from the TechnicalUniversity of Catalonia, Barcelona, Spain, in 2005.

He has worked for the Spanish National Centerfor Microelectronics, the Technical University ofCatalonia, and Hewlett-Packard Inkjet CommercialDivision. In January 2006, he joined the XilinxResearch Labs, Dublin, Ireland. His interests includesystem-level design, reconfigurable architectures,

and low-power design techniques. He has published papers in internationaljournals and conference proceedings.

Rosa M. Badia received the B.Sc. and Ph.D. degreesin computer science from the Technical University ofCatalonia, Barcelona, Spain, in 1989 and 1994, re-spectively.

She is currently an Associate Professor in theComputer Architecture Department of the TechnicalUniversity of Catalonia, and Project Manager atthe Barcelona Supercomputing Center, Barcelona,Spain. Her interests include CAD tools for VLSI,reconfigurable architectures, performance predictionand analysis of message passing applications, and

GRID computing. She has published papers in international journals andconference proceedings

Date post:	21-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Juanjo Noguera and Rosa M. Badiahpc.ac.upc.edu/PDFs/dir06/file000267.pdf730 IEEE TRANSACTIONS ON...

Documents