Juanjo Noguera Rosa M. Badiahpc.ac.upc.edu/PDFs/dir05/file003050.pdf · Juanjo Noguera* Inkjet...

PERFORMANCE AND ENERGY ANALYSIS OF TASK-LEVEL GRAPH TRANSFORMATION TECHNIQUES FOR DYNAMICALLY

RECONFIGURABLE ARCHITECTURES

Juanjo Noguera*

Inkjet Commercial DivisionHewlett-Packard

email: [email protected]

Rosa M. Badia†

Computer Architecture Dept.Technical University of Catalonia (UPC)

email: [email protected]

ABSTRACT

In this paper, we present an analysis of the impact in both performance and energy of several task-level graph trans-formation techniques to exploit the parallel processing ca-pabilities of run-time partially reconfigurable architectures. The proposed techniques have been applied to an image processing application (i.e., image sharpening), which hasbeen implemented in a real research platform.

1. INTRODUCTION

Dynamic Reconfiguration is a particularly attractive tech-nique to increase the effective use of programmable logic blocks. However, the reconfiguration overhead has to be minimized in order to improve application performance. Several techniques have been proposed to improve per-formance in run-time reconfigurable devices [10], [4].On the other hand, Energy efficiency has become a critical issue in the design of embedded systems, especially if we consider the increasing market-share of portable (i.e., bat-tery-constrained) embedded systems. In addition, the device run-time reconfiguration is a power-hungry process. How-ever, research efforts improving both performance and en-ergy consumption for dynamically reconfigurable architec-tures have not been widely addressed in the literature. In this paper, we present a study, in terms of both performanceand energy consumption, of several task-level (i.e., coarse-grained; not instruction-level) graph transformation tech-niques for fine-grained partially reconfigurable architec-tures. This is an open issue that has not been addressed in previous research efforts.This paper is organized in five more sections. Section 2 introduces the previous work. Afterwards, we introduce our target architecture in section 3. The task-level graph trans-formations are explained in section 4. The experiments and results obtained from an image processing application are shown in section 5. Finally, section 6 presents the conclu-sions of this work.

2. PREVIOUS WORK

Software pipelining for reconfigurable computing was in-troduced in [11]. However, this work does not cover dy-namically reconfigurable devices. In [12] it is explained how loop-unrolling increases the instruction level parallel-ism of the hardware co-processors. They cover both stati-cally and dynamically reconfigurable devices, but their ap-proach is only targeting fine-grained parallel techniques.They do not consider coarse-grained (i.e., task-level) paral-lel techniques, which is the main contribution of this paper. Moreover, these research efforts do not report any resultson energy consumption. Recently, several research efforts have addressed the energy consumption issue for recon-figurable computing [3], [5], [9], [13], [15]. In [7] it is shown that configuration pre-fetching and frequency scal-ing could reduce the energy consumption without affectingperformance. However, none of the previous research ef-forts studied the impact on the energy consumption of the task-level graph transformation techniques. In this paper,we address this open issue and study several design trade-offs in terms of performance and energy.

3. TARGET ARCHITECTURE

The dynamically reconfigurable architecture [9] includes an embedded CPU, a set of dynamically reconfigurable proc-essors (DRPs), an on-chip L2 multi-bank memory sub-system and external DRAM memory resources.An example of this architecture is shown in Fig. 1. The data

* Juanjo Noguera acknowledges the support of the HP-IPGResident Fellowship program.

† CICYT project TIN2004-07739-CO2-01 and DURSI pro-ject 2001SGR00226

On-ChipL2 Memory

ExternalDRAM

DSU

DRP2

L1

DRP0

L1

DRP3

L1

DRP1

L1

CPU

I D

Bank 0

Bank 1

Bank N

• • •

PU

PU

PU

On-ChipL2 Memory

ExternalDRAM

DSU

DRP2

L1

DRP2

L1

DRP0

L1

DRP0

L1

DRP3

L1

DRP3

L1

DRP1

L1

DRP1

L1

CPU

I D

CPU

I D

Bank 0

Bank 1

Bank N

• • •

PU

PU

PU

Fig. 1. Target Dynamically Reconfigurable Architecture

0-7803-9362-7/05/$20.00 ©2005 IEEE 563

that must be transferred between tasks is stored in the on-chip L2 memory. Each DRP processor can be independ-ently reconfigured (i.e., the target architecture supports multiple reconfigurations running concurrently).The on-chip L2 memory sub-system includes, for each DRP, an independent hardware-based data and configura-tion pre-fetch unit (PU in Fig. 1). Thus, the architecture supports concurrently, the transfer of data in one DRP over-lapped with the reconfiguration of a different DRP. A hardware-based data pre-fetching mechanism is proposed to hide the memory latency.Finally, the target architecture includes a Dynamic Schedul-ing Unit (DSU) for scheduling task executions and DRP reconfigurations. A detailed description of this hardware-based dynamic scheduling approach can be found in [8].

4. TASK-LEVEL GRAPH TRANSFORMATIONS

These techniques are integrated in a design methodologyfor embedded systems that was introduced in [6], [9]. In this approach, the input application is specified as a task-graph, where nodes represent tasks (i.e. coarse-grainedcomputations – convolutions, DCT, etc.) and edges repre-sent data dependencies. Finally, each task has also associ-ated a task type, which represents a given type of computa-tion. A task type could be implemented in the embeddedCPU or in a DRP processor as a reconfiguration context.

4.1. Introduction and Motivation

Configuration pre-fetching is a well-known technique inreconfigurable computing to hide the reconfiguration over-head. On the other hand, we assume that the input data sethas been partitioned in several data blocks, which meansthat the task-graph must be iterated as many times as the number of data blocks in which we have divided the inputdata [9]. The task-graph used in this example is shown in Fig. 2, where we can observe that each task has its execu-tion time at its right. We also assume that this task graph is iterated two times, since it must process two data blocks.Configuration pre-fetching is based on the idea of loadingthe required context on a given DRP before it is actually required. In our approach, the configuration pre-fetching of a task can start when all its predecessor tasks have started

their execution [8]. For instance, the configuration pre-fetching of task T2 could start after task T1 has begun its execution. On the other hand, the execution of a task might start when all its predecessors have finished their execution[8] (task T2 can start when task T1 has finished). In spite of hiding the reconfiguration overhead for all DRP processors,we can observe in Fig. 2 that the DRP processors are highlyunder-utilized (e.g., see DRP2).

4.2. Task Pipelining and Blocking

Task pipelining could be used to increase the performance and utilization of the DRP processors. In Fig. 3, we can observe that tasks T1 and T5 run concurrently but in differ-ent task-graph iterations (i.e., we iterate the task-graph since we have partitioned the input streaming data in severalblocks). The number of DRP processors limits the number of task-graph iterations that might be running in parallel.In addition, we could use a task blocking technique to im-prove performance. This technique is based in the idea of partitioning the tasks that have a long execution time. Thus, these longer tasks are partitioned in a given number of sub-tasks, each one processing a smaller amount of data, there-fore decreasing their execution time. This technique can be observed in Fig. 3, where we have partitioned tasks T3 and T4 in two shorter sub-tasks. This technique helps to im-prove performance (i.e., compare Fig. 2 and Fig. 3 at time-stamp 7, where we can see that task T4 might start its exe-cution one time-unit in advance).If we want to hide the reconfiguration latency, task block-ing only makes sense for those tasks that have an execution time longer than the reconfiguration time. The reconfigura-tion time is a limit to apply task blocking. For instance,assuming that task T2 has an execution time equivalent to the reconfiguration time (one time unit in this example), it makes no sense to apply task blocking to task T2 because task T3 could not start execution until its reconfigurationhas finished (that is, the reconfiguration process of task T3 is in the critical path and not the execution of task T2).In the general case, we might partition the initial task in a given number of sub-tasks in such a manner that the execu-tion time of the new sub-tasks at least equals the reconfigu-ration time. Moreover, we use the concept of configuration

T1

T2

T3

T4

T5

3

1

2

3

2

R 0 0 0

T1

R 0

T2

R 0 0

T3

R 0 0

T4

R

T5

0 0 0

R 1 1 1

T1

R 1

T2

R 1 1

T3

R 1 1

T4

R

T5

1 1 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

DRP0

DRP1

DRP2

24

T1

T2

T3

T4

T5

3

1

2

3

2

T1T1

T2T2

T3T3

T4T4

T5T5

3

1

2

3

2

R 00 00 00

T1

R 00

T2

R 0 000 00

T3

R 0 000 00

T4

R

T5

0 0 000 00 00

R 11 11 11

T1

R 11

T2

R 1 111 11

T3

R 1 111 11

T4

R

T5

1 1 111 11 11

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

DRP0

DRP1

DRP2

24

Fig. 2. Example of configuration pre-fetching

R 0 0 0

T1

R 0

T2

R 0 0

R 0 0

R

T5

-1 -1 -1

R 1 1 1

T1

1

T2

R 1 1

T3

R 1 1

R

T5

0 0 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

T3*T3 T3*

T4 T4* T4 T4*

T1

T2

T3

T4

3

1

1

1

T5 3

T3*

T4*

1

1

DRP0

DRP1

DRP2

R 00 00 00

T1

R 00

T2

R 0 000 00

R 0 000 00

R

T5

-1 -1 -1-1-1 -1-1 -1-1

R 11 11 11

T1

11

T2

R 1 111 11

T3

R 1 111 11

R

T5

0 0 000 00 00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

T3*T3 T3*

T4 T4* T4 T4*

T1

T2

T3

T4

3

1

1

1

T5 3

T3*

T4*

1

1

DRP0

DRP1

DRP2

Fig. 3. Example of task pipelining and blocking

564

re-use to minimize the number of DRP reconfigurations.For instance, tasks T3 and T3* use the same configuration context (i.e., the same hardware but applied to different data blocks), so they are executed sequentially in DRP2. Thus, reducing the number of DRP reconfigurations also helps to reduce the total energy consumption.

4.3. Task Type (Configuration) Replication

Finally, we could use a more configuration-aggressivetechnique to improve performance. The technique of con-figuration replication reconfigures several DRPs with thesame configuration context. This enables the improvement of architecture performance, but also increases the energy consumption. An example of this technique can be ob-served in Fig. 4, where tasks T1 and T4 are partitioned us-ing this replication technique. For example, in the case of tasks T1 and T1*, the same configuration context is loaded concurrently to DRP0 and DRP1.Task type replication improves performance (i.e. there is a reduction of two time units in each task-graph iteration). However, this technique also increases the number of DRP reconfigurations in each iteration (i.e. in task-type replica-tion we need 7 reconfigurations per iteration. See Fig. 4).This increment in the number of reconfigurations implies an increment in the energy consumption.

5. EXPERIMENTS AND RESULTS

We have prototyped the proposed architecture on the Ga-lapagos system (see Fig. 5), which is a PCI-based system

(64bit/66MHz). It is based on Xilinx FPGA’s and high-bandwidth DDR SDRAM memory. The dynamicallyscheduling unit and the data pre-fetch units of the L2 memory subsystem (see Fig. 1) have been mapped to aVirtex-II Pro device [1]. The DRP processors of our archi-tecture are implemented in the Galapagos system using three Virtex-II devices (i.e. XC2V1000) [2].The proposed dynamically reconfigurable architecture is addressing streaming data (computation intensive) embed-ded applications. That is, applications with a large amount of data-level parallelism. Image processing applications area good example of the type of applications that we are ad-dressing. These applications are becoming more and more sensible to power consumption, especially if we consider the increasing market-share of digital cameras or mobile phones, which require this kind of image processing tech-niques. In this sense, we have selected an image sharpeningapplication (i.e., unsharp masking; see Fig. 6.a). Three dif-ferent input image sizes have been used in the experiments:(1) 256x256; (2) 512x512; and (3) 768x768.

5.1. Tasks Implementation Results

The input images have been partitioned in smaller data blocks of: (a) 64x64 pixels (to be used with on-chip mem-ory); and (b) 256x256 pixels (to be used with off-chipmemory). Depending on the size of these data blocks, we should iterate the task graph a different number of times [9]. For example, the 512x512 pixels input image could be partitioned in four sub-blocks of 256x256 pixels (i.e. the unsharp-masking task graph must be iterated four times).In order to minimize the reconfiguration overhead, we have used the partial reconfiguration capability of Virtex-II de-vices (see Fig. 7). We have reduced the reconfiguration time of a Virtex-II device XC2V1000 from 8ms (full device reconfiguration) to 1.3ms (average partial device reconfigu-ration time), using a reconfiguration clock of 66MHz.Moreover, in Fig. 6.b, we can observe the execution time for the tasks of the unsharp masking application. We can

DRP0

DRP1

DRP2

R 0 0

T1

R 0

T2

R 0 0

R

T5

-1 -1 -1 R 0

R

1 2 3 4 5 6 7 8 9 10 11 12 13 14

T3

T4

T4*

0

T1*

R 0

R 1 1

T1

R 1

T2

R 1 1

R

T5

0 0 0 R 1

R

T3

T4

T4*

1

T1*

R 1

T1

T2

T3

T4

2

1

2

1

T5 3

T4* 1

T1* 1

DRP0

DRP1

DRP2

R 00 00

T1

R 00

T2

R 0 000 00

R

T5

-1 -1 -1-1-1 -1-1 -1-1 R 00

R

1 2 3 4 5 6 7 8 9 10 11 12 13 14

T3

T4

T4*

00

T1*

R 00

R 11 11

T1

R 11

T2

R 1 111 11

R

T5

0 0 000 00 00 R 11

R

T3

T4

T4*

11

T1*

R 11

T1

T2

T3

T4

2

1

2

1

T5 3

T4* 1

T1* 1

Fig. 4. Example of task type replication

Fig. 5. The Galapagos reconfigurable system

RGB2YCrCbRGB2YCrCb

Blur

Sub

Add

YCrCb2RGBYCrCb2RGB

(a)

256x256

64x64

6908us3652us10497us4625usPentium III

1GHz

207us137us69us205usGalapagos DRP

60MHz

3276us

1266us

YCrCb2RGB

2184us

564us

Add/Sub

1092us3276usGalapagos DRP

60MHz

1860us852usPowerPC300MHz

BlurRGB2YCrCb

256x256

64x64

6908us3652us10497us4625usPentium III

1GHz

207us137us69us205usGalapagos DRP

60MHz

3276us

1266us

YCrCb2RGB

2184us

564us

Add/Sub

1092us3276usGalapagos DRP

60MHz

1860us852usPowerPC300MHz

BlurRGB2YCrCb

300mW

1600mW

1450mW

DRP+Off-Chipmemory

150mW

450mW

Not apply

EmbeddedPowerPC405

600mWPower

Execution

300mWPower

Idle/Wait

450mWPowerReconfig.

DRP+On-Chipmemory

300mW

1600mW

1450mW

DRP+Off-Chipmemory

150mW

450mW

Not apply

EmbeddedPowerPC405

600mWPower

Execution

300mWPower

Idle/Wait

450mWPowerReconfig.

DRP+On-Chipmemory

(b)

(c)

Fig. 6. (a) task graph for the unsharp masking application; (b) tasks’ execution time; (c) DRP and PowerPC power consumption

565

see the execution time for the two possible sizes of the data blocks. This execution time has been obtained in three dif-ferent platforms (i.e., a Galapagos DRP @ 60 MHz, a Pen-tium II @ 1GHz and a PowerPC @ 300MHz).In addition, in Fig. 6.c, we show the DRP (i.e., XC2V1000) and PowerPC power consumption in their corresponding possible states. For the DRP-based solution, we show two possible implementations: (a) one using on-chip memory,which requires less power; and (b) one using off-chip mem-ory, which requires more power due to the external DRAMresources. This average power consumption has been ob-tained implementing a gate-level accurate simulation after the place&route process for all the tasks. This idle statepower represents the device’s leakage power. Thus, there is a waste of energy when the DRP’s are not used. Since the proposed transformation techniques increase the utilization of the DRP’s, they help to reduce the energy consumption.The performance results have been obtained from execu-tions on the Galapagos system running in a PC environ-ment. The execution generates a trace file with the state changes of the three Virtex-II devices. We have obtainedthe energy from: (1) the power consumption of the device, in its several states (reconfiguration, execution and idle); and (2) the execution trace file, from where we can obtain the amount of time that a device has been in a given state.

5.2. Energy-Performance Results

Fig. 8 shows the performance results obtained when apply-ing the techniques to the unsharp masking application. Re-sults demonstrate that the technique of task pipelining andblocking (“pipe_block” in Fig. 8) obtains better results than only using configuration pre-fetching (“seq”) or only using task pipelining (“pipe”). Remember that in this technique, we have also used the concept of configuration reuse, in order to minimize the reconfiguration penalty and energyconsumption. The technique of adding configuration repli-cation (“pipe_rep_smart”) obtains better results, when us-ing three DRP processors. However, this is not the casewhen we use two DRP processors, where we observe that configuration replication gives worse results.Fig. 9 shows the energy consumption study for the task-level transformation techniques. Fig. 9.a shows the energy consumption when we use off-chip memory and Fig. 9.bshows the case where we use on-chip memory (see Fig. 1).The benefits in terms of energy savings are more significantwhen we use on-chip memory (i.e., the DRP processorsrequire less power for execution and reconfiguration). Fi-nally, we must note that when increasing the number ofDRP processors (independently of the memory strategy; on-chip or off-chip), there is an increment on the energy con-sumption. This is due to the DRP’s energy consumption in the idle state (i.e., static power), which is higher when we increase the number of DRP processors. This can be clearly seen in the un-modified task-graph (i.e., “seq”) for a XC2V1000 device in Fig. 9.b, when we use two or three DRP processors. The proposed task-level transformationtechniques reduce this idle energy penalty.It is observed that the technique of task pipelining and blocking is the more energy-efficient technique. We can observe that, for this application, the technique of task pipe-lining and blocking is the optimum solution in terms of minimizing the energy used for reconfiguration and maxi-mizing the energy used for execution. Configuration repli-cation helps to maximize the device utilization but it also increases the energy used for reconfiguration.

(a) (b) (c)(a) (b) (c)

Fig. 7. Partial bitstreams on several Virtex-II devices; (a) XC2V1000; (b) XC2V500; (c) XC2V250

0

20

40

60

80

100

120

140

Exec

utio

n Ti

me

(ms)

xc2-250 xc2-500 xc2-1000 xc2-250 xc2-500 xc2-1000

DRP=2 DRP=3

Unsharp Masking Application

seq pipe2 pipe2_block pipe2_rep_smart

Fig. 8. Performance results

566

For these applications, and when compared to the initial un-modified task graph, the proposed transformation tech-niques obtain an average performance improvement of 32%, and reduce by 18% the energy consumption.

6. CONCLUSIONS

In this paper, we have analyzed, in terms of performanceand energy consumption, several task-level (i.e., coarse-grained) graph transformation techniques for dynamicallyreconfigurable architectures. Several image-processingbenchmarks have been implemented on a real research pro-totype based on a Virtex-II pro device. We have used these benchmarks to test the task-level graph transformation techniques. The proposed techniques use: (a) configurationpre-fetching to improve performance; and (b) configurationre-use to reduce the energy consumption. For the applica-tions in our case study (i.e., with large amounts of data-level parallelism), the technique of task pipelining and

blocking improves both performance and energy consump-tion, when compared to the initial un-modified task graph. Increasing the number of DRP processors helps to improveperformance, but it also increases the energy consumption, due to the increase in the DRP’s static power. The proposed task-level graph transformation techniques help to minimize this leakage energy penalty.

7. REFERENCES

[1] http://www.xilinx.com/virtex2pro

[2] http://www.xilinx.com/virtex

[3] J. Khan, R. Vemuri, “An Efficient Battery-Aware Task Scheduling Methodology for Portable RC Platforms and Applications”. Proc. FPL’04. Antwerp, Belgium.

[4] R. Maestre, M. Fernandez, F. Kurdahi, N. Bagherzadeh, H. Singh, “Kernel Scheduling in Reconfigurable Computing”, Proc. of DATE’99.

[5] S. Mohanty, V. Prasanna, “A Framework for Energy Efficient Design of Multi-Rate Applications using Hybrid Reconfigurable Systems”. Proc. FPL’04. Antwerp, Belgium.

[6] J. Noguera, R. M. Badia, “HW/SW Codesign Techniques for Dynamically Reconfigurable Architectures”, IEEE Trans. on VLSI Systems. Vol. 10. Issue 4. August 2002.

[7] J. Noguera, R. M. Badia, “System-Level Power-Performance Trade-Offs in Task Scheduling for Dynamically Reconfigurable Architectures”, Proc. of CASES’03.

[8] J. Noguera, R. M. Badia, “Multitasking on Reconfigurable Architectures: Micro-architecture Support and DynamicScheduling”. ACM TECS. May 2004.

[9] J. Noguera, R. M. Badia, “Power-Performance Trade-offsfor Reconfigurable Computing”. CODES+ISSS’04. September 2004.

[10] K. Purna, D. Bhatia, "Temporal Partitioning and Scheduling Data Flow Graphs for Re-configurable Computers", IEEE Trans. on Computers, vol. 48, No. 6.

[11] T. Callahan and J. Wawrzynek, “Adapting Software Pipelining for Reconfigurable Computing”. In Proc. of CASES'00, November, 2000.

[12] M. Weinhardt and W. Luk, “Pipeline Vectorization”. IEEE Trans. on Computer-Aided Design of Integrated Circuits andSystems, 20(2):234-248, February 2001

[13] S. Wilton, S. Ang and W Luk, “The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays”.Proc. FPL’04. Antwerp, Belgium.

[14] Greg Snider, “Performance-constrained pipelining of software loops onto reconfigurable hardware”. Proc of FPGA’02.

[15] G. Stitt, F. Vahid, “Using on-chip configurable logic to reduce embedded system software energy”. Proc. FCCM’02.

0

50

100

150

200

250

300

350

Ener

gy C

onsu

mpt

ion

(mJ)

xc2-250 xc2-500 xc2-1000 xc2-250 xc2-500 xc2-1000

DRP=2 DRP=3



(a)

0

20

40

60

80

100

120

140

160

Ener

gy C

onsu

mpt

ion

(mJ)

xc2-250 xc2-500 xc2-1000 xc2-250 xc2-500 xc2-1000

DRP=2 DRP=3



(b)

0

50

100

150

200

250

300

350

Ener

gy C

onsu

mpt

ion

(mJ)

xc2-250 xc2-500 xc2-1000 xc2-250 xc2-500 xc2-1000

DRP=2 DRP=3



(a)

0

50

100

150

200

250

300

350

Ener

gy C

onsu

mpt

ion

(mJ)

xc2-250 xc2-500 xc2-1000 xc2-250 xc2-500 xc2-1000

DRP=2 DRP=3



(a)

0

20

40

60

80

100

120

140

160

Ener

gy C

onsu

mpt

ion

(mJ)

xc2-250 xc2-500 xc2-1000 xc2-250 xc2-500 xc2-1000

DRP=2 DRP=3



(b)

0

20

40

60

80

100

120

140

160

Ener

gy C

onsu

mpt

ion

(mJ)

xc2-250 xc2-500 xc2-1000 xc2-250 xc2-500 xc2-1000

DRP=2 DRP=3



(b)

Fig. 9. Energy Results; (a) off-chip memory;(b) on-chip memory

567

Date post:	08-Oct-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Juanjo Noguera Rosa M. Badiahpc.ac.upc.edu/PDFs/dir05/file003050.pdf · Juanjo Noguera* Inkjet...

Documents