VLSI ARCHITECTURE OF THE RECONFIGURABLE COMPUTING … · which need more ﬂexible and powerful...

VLSI ARCHITECTURE OF THE RECONFIGURABLE COMPUTING ENGINE FORDIGITAL SIGNAL PROCESSING APPLICATIONS

Lien-Fei Chen and Yeong-Kang Lai

Department of Electrical EngineeringNational Chung Hsing University, Taiwan, R.O.C

Email: [email protected], [email protected]

ABSTRACT

In this paper, a novel reconfigurable computing engine for digi-tal signal processing applications is proposed. The kernel compo-nent of the reconfigurable computing (RC) engine is the general-purpose processing cluster (GPPC) array, which is constructed ofthe GPPCs, as an MIMD model to achieve high flexibility for map-ping applications and algorithms to the RC engine. GPPC per-forms the data-parallelism operations efficiently using the SIMDinsturctions. Therefore, GPPC can not only execute the 32-bit op-erations but also perform 4-way 8-bit operations or 2-way 16-bitoperations simultaneously. For the efficient connectivity, the inter-GPPC-row reconfigurable network is also proposed to achieve therequirements of high flexibility, low complexity, small area andshort network delay.

1. INTRODUCTION

Owing to the increase of the computation complexity for DSPalgorithms and the evolution of the various emerging standards,which need more flexible and powerful tools, traditional designmethodology is not enough to support the real-time data process-ing with ample flexibilty requirements. ASIC architecture alwaysconforms the performance and power consumption requirementsfor targeted application, although, the lack of flexibility is its chiefshortcoming. Programmable microprocessor and programmableDSP have formidable flexibility due to their versatile instructionsets that allow the implementation of any computable task. How-ever, they are not sufficient to handle high computation and data-intensive applications such as MPEG-4 or H.264 video encoder.

In the last decade, the new class of reconfigurable comput-ing architectures has been emerged. Reconfigurable computingarchitectures promise to overcome this traditional trade-off andachieve both the performance of ASICs and the flexibility of gen-eral purpose processors. Many innovative reconfigurable archi-tectures have been proposed [1]-[7] which are investigated by thefamous universities and corporations. For different targeted appli-cations, reconfigurable computing structures need to have differentconsiderations such as granularity, memory structures, instructionstream, and interconnection topology. Each of the above factorsseriously impact the performance, energy efficiency and flexibil-ity of reconfigurable architecture. This paper presents a novel re-configurable computing architecture to efficiently perform digitalsignal processing operations.

2. THE PROPOSED RECONFIGURABLE COMPUTINGARCHITECTURE

Fig. 1. The organization of the proposed reconfigurable computingsystem.

Fig. 1 depicts the components and organization of the pro-posed reconfigurable computing system. The proposed reconfig-urable system consists of three major kernels: the proposed re-configurable engine, an ARM RISC processor and other periph-eral device such as DMA controller and high-memory bandwidthmemory interface, which is the bridge between the AHB bus andthe external main memory.

2.1. System Components

Fig. 2. Block diagram of the proposed reconfigurable computingengine (RC engine) with inter-GPPC-row reconfigurable networkfor general-purpose DSP applications.

Reconfigurable computing engine (RC engine) for digital sig-nal processing applications is the main component of the proposedreconfigurable computing system. Fig. 2 shows the detail archi-tecture of the proposed RC engine. Our proposed RC engine isa hierarchical architecture which consists of one or more array

slices. Each slice is constituted of a 4 × 4 GPPC array whichhas four GPPC rows within an array slice. Each row comprisesfour GPPCs as the MIMD architecture. Owing to the inherentcharacteristic of the compute–intensive and data–parallelism op-eration types for DSP algorithms, each GPPC has a data memorymodule (D–Mem) with two banks, an instruction memory module(I–Mem) and four RC cells are regarded as the SIMD architectureto perform the DSP operations in parallel. The inter-GPPC-rowreconfigurable network can guarantee arbitrary, full, connectivityamong the adjacent two GPPC rows to achieve the requirementsof the complete flexibility.

The system controller of the reconfigurable computing sys-tem is a high performance processor, called ARM RISC processor.ARM processor handles the system control of the proposed re-configurable computing engine and general operation deficient inparallel processing.

2.2. Features of the Proposed RC Engine

The RC engine is configured through the instructions, which arestored in the I-Mem within each GPPC architecture. Since theGPPC architecture follows the SIMD model with SIMD insturctionsand split-ALU concept of computation, four RC cells in the GPPCarchitecture therefore share the same instruction word. In brief,the important features of the proposed RC engine are:

• Multi-level coarse–grain architecture: The datapath ofeach RC cell is an 8–bit datapath to execute the 8–bit opera-tion types. GPPC can not only perform the four 8–bit oper-ations at four RC cell simultaneously but also combine twoRC cells to execute an 16–bit operation or combine four RCcells to execute an 32–bit operations. So, we call the GPPCa multi–level coarse–grain architecture which can performthree granularity operations: 8, 16 or 32–bit operation usingproposed SIMD insturctions.

• Interconnection Network gives considerations to perfor-mance, utilization and flexibility: The trade-off of the per-formance, utilization, and flexibility is always regarded asthe design challenge of the interconnection network topol-ogy. The fine-grain architecture like FPGA device has thepoor interconnection network utilization. The greater partof the coarse-grain architecture such as MorphoSys, RE-MARC, . . . etc. has the simple and regular interconnec-tion network to support their SIMD models. However, thepoor connection flexibility is the major drawback of the in-terconnection network within these coarse–grain architec-tures. Through many mapping analysis for many DSP al-gorithms, it is necessary to design the full interconnectionand high performance network topology between the adja-cent GPPC architecture rows to achieve the efficiency of re-source transmission and utilization. According to the aboveconsideration, the inter-GPPC-row reconfigurable networkwithin RC engine is designed to avoid the poor flexibilityduring performing the DSP algorithms.

3. RC ENGINE COMPONENTS

In this section, the kernel of the RC engine, namely, reconfigurablecomputing (RC) cell, general–purpose processing–cluster (GPPC)architecture, and the RC engine interconnection network, are de-scribed.

3.1. General–Purpose Processing–Cluster (GPPC)

GPPC array is the reconfigurable core of RC engine. RC engine,which is an 4 × 4 array, consists of four rows of GPPC row, andeach row is constructed of the four identical GPPCs as shown inFig. 2.

Fig. 3. The detail architecture of GPPC.

As shown in Fig. 3, GPPC, which adopts SIMD model, iscomposed of four identical RC cells, a data memory module (D–Mem) with two banks, and an instruction memory module (I–Mem). Each RC cell is the basic reconfigurable unit. Its functionalmodel is similar to the 8–bit datapath of the conventional micro-processor. The GPPC is composed of 4 RC cells to perform four8–bit data operations or two 16–bit operations simultaneously, or32–bit operation only such as split–ALU. The SIMD instruction,which programs the RC cells meanwhile, is broadcasted from theI–Mem in GPPC.

Fig. 4. RC cell architecture.

The basic functional unit of computing core is RC cell and it iscomposed of an ALU, a multiplier, a barrel shifter, and two multi-plexers to select its appropriate input data as shown in Fig. 4. BothALU and MUL have been designed for 8–bit inputs. In addition tostandard logic and arithmetic functions, the ALU has other func-tions, such as the computation of absolute value of the differencebetween two operands, the comparison of the value between twooperands and the multiplication with constant value.

The two input multiplexers (Fig. 4) select one of several in-puts for the ALU or MUL datapath, based on control bits from theinstruction in the instruction register of the RC cell. These inputscontain the outputs from the other GPPC row, which is not the ad-jacent GPPC row, outputs of the inter-GPPC-row reconfigurablenetwork and D–Mem within the GPPC.

The I–Mem broadcasts instruction word to the RC cells in theGPPC. A instruction word is loaded every execution cycle fromthe I–Mem into the instruction register of each RC cell. In or-der to achieve the instruction compression, ARM RISC processorbroadcasts the instruction identifiers and each identifier lookup itsmeaning in I–Mem locally within the GPPC. For the proposed re-configurable computing engine, the major focus is on regular and

data–parallel DSP applications. Based on this idea of regularityand sub–parallelism for DSP application, each instruction word isbroadcasted to the GPPC architecture of RC Cells. Thus, all fourRC cells within GPPC share the same instruction word and per-form the SIMD operations.

In order to provide the input data of the RC cells adequately,the D–Mem is a data memory organized into two banks, Bank 1 andBank 2. D–Mem provides data for the GPPC architecture compu-tations and also stores processed data from GPPC. Two banks oforganization is designed to satisfy the possibility of the two inputdata fetched from the data memory modules at the same time.

3.2. Interconnection Network

The RC engine is a hierarchical architecture, which is organizedby one or more slices of the GPPC array. Therefore, the inter-connection network of the proposed RC engine consists of threelayers. The first two kinds of three interconnection network levelsbelong to the inter–slice interconnection network, which is con-nected between the GPPC within the same slice. The inter-GPPC-row reconfigurable network is the first layer of the interconnectionnetwork. The second layer of network is used to communicate thetwo GPPCs across a GPPC row, as shown in Fig. 2. The thirdlayer of the interconnection network is an optional network topol-ogy when the RC Engine has many slices of GPPC array. It isdesigned to interconnect the GPPC architecture across the slice ofthe GPPC array. The detailed interconnection network topologyof the inter-GPPC-row reconfigurable network will be illustratedat the following paragraph.

Fig. 5. Inter-GPPC-row reconfigurable network.

It is hard to take care of flexibility and complexity for inter-connection network topology simultaneously. When the intercon-nection network is a full interconnection network, which has thehighest flexibility, the complexity, area overhead of network andnetwork delay will be the challenge of the reconfigurable archi-tecture. Nevertheless, the mapping of many DSP algorithms andapplications to the regular and low complexity interconnection net-work will be restricted by the confined interconnection networktopology. Through the analysis of the DSP algorithms and ap-plications mapping, the analysis result reveals that the RC enginemust have a highly flexible network owing to the high utility rateand the complexity of the communication between the adjacentGPPC rows. Hence, as shown in Fig. 5, the inter-GPPC-row re-configurable network is a full and bi-directional interconnectionnetwork, The proposed topology is a multistage network whichhas high flexibility, low complexity, small area and short networkdelay and it can make data duplicate.

Fig. 6. Connection block type1∼3.

The inter-GPPC-row reconfigurable network is a 7 stage net-work with the Benes-like connecting wires, and it is symmetric.The stage 1, 2, 6, 7 consist of the type-1 connection block. Thestage 3, 5 consist of the type-2 connection block. The stage 4consists of the type-3 connection block. In order to make a bi-directional interconnection, a basic pass block consists of a pair oftri-state buffers, as shown in Fig 6(a). The structures of the type-1∼ type-3 are shown in the Fig 6.

4. MAPPING ALGORITHMS AND PERFORMANCEANALYSIS

In this section, we discuss the mapping of DSP algorithms. Videocompression has a high degree of data–parallelism and tight real–time constraints. Here, we discuss two major functions, motionestimation for FSBMA and transformation coding using DCT al-gorithm of the MPEG video encoder as mapped to the proposedRC engine.

4.1. Video Compression: Motion Estimation for FSBMA

Among the different block-matching methods, full-search block-matching algorithm (FSBMA) involves the maximum computa-tions. However, FSBMA provides an optimal performance andsolution with low control overhead.

When the proposed reconfigurable architecture performs theFSBMA computation, initially, one current block and its corre-sponding search area are loaded into D–Mem of GPPC. The firstGPPC row accesses the pixels of current block data and candi-date block data from the Bank1 and Bank2 of the D–Mem re-spectively to implement the block–matching process. Therefore,the first GPPC row can calculate sixteen absolute difference val-ues simultaneously. The other GPPC are configured as paralleladder tree and comparator. The overall architecture is a seven–stage pipeline architecture to perform the block–matching motionestimation algorithm. For the case of current block size N = 16

and search range from −8 to +8, there are 289 candidate blocksin each search area and the simulation result shows that a totalof 4631 cycles are required to perform the current block–matchingprocess. The literature [4] shows that MorphoSys has the optimumperformance compared with some ASIC architectures and highperformance DSP processor, TMS320C64X. For the same caseof the block-matching process, it costs 4692 cycles to match thecurrent block-matching process for MorphoSys architecture [4].

4.2. Video Compression: Discrete Cosine Transform (DCT)for MPEG

Discrete cosine transform as a transform coding is used in im-age/video encoders. The 2–D DCT is applied on a 8 × 8 macro-block. It is a popular method to implement the 2–D DCT usingrow–column decomposition to calculate 1–D DCT at row and thento calculate 1–D DCT at column. We adopt the popular fast eightpoint 1–D DCT algorithm: Chen’s algorithm to implement the8 × 8 2–D DCT transform coding. Chen’s algorithm involves 16multipliers and 26 additions as shown in Fig. 7, and it is very suit-able for programmable general-purpose processor or DSP.

Fig. 7. Chen’s algorithm for eight points 1–D DCT

According to the data type of the output signal, Chen’s algo-rithm can be partitioned into even part and odd part. The compu-tation complexity of odd part of Chen’s algorithm is more com-plicated than its even part. The even part has four pipeline stagesto accomplish its 1–D DCT coding, however, odd part needs sixpipeline stages to produce the results. For mapping the DCT trans-form coding on proposed RC Engine using Chen’s algorithm, pro-posed 4× 4 GPPC array is partition into two parts: upper part andlower part. Each part has two GPPC rows to map the even part orodd part of Chen’s algorithm. Apparently, lower part of RC En-gine must cost more time to calculate the Chen’s algorithm thanupper part of RC Engine. Fig. 8 shows the operation steps for 1–DDCT transform coding.

Fig. 8. Operation steps for RC engine with 2–D DCT using Chen’salgorithm.

Due to the computation complexity of odd parts for Chen’salgorithm, eight 1–D DCT within the row of 8 × 8 block are cal-culated in 24 cycles. The identical condition occurs across theeight columns. Therefore, proposed RC Engine requires 51 cy-cles (3+24+24 and 3 is the latency because of the pipeline stage)

to complete 2–D DCT on an 8 × 8 macro-block. MorphoSys [4]only requires 37 cycles to perform 2–D DCT algorithm becauseof its 8 × 8 RC Cell array with 16–bit ALU+MULT in each RCCell. REMARC [5], which is another 8 × 8 reconfigurable co-processor with 16–bit nano processor, takes 54 cycles to complete2–D DCT algorithm. A high performance DSP video processor, TITMS320C64x, needs 76 cycles [8], while, a dedicated superscalarmultimedia processor, the NEC V830R/AV, demands 201 cycles[4]. Intel Pentium II uses 240 cycles to compute the 2D–DCT withMMX instructions.

5. CONCLUSION

A novel reconfigurable computing engine for digital signal pro-cessing applications is proposed in this paper. Base on the ker-nel components: 4 × 4 GPPC architecture of the RC Engine asMIMD structure and its GPPC architecture as an SIMD model,the proposed reconfigurable architecture can perform the data–parallelism applications efficiently. Furthermore, for the efficientconnectivity, the inter-GPPC-row reconfigurable network is pro-posed to achieve the flexibility of the communication and the shortinterconnection network wire delay.

6. ACKNOWLEDGMENT

This work was supported by the National Science Council of theRepublic of China under Contract NSC 92-2218-E-005-008.

7. REFERENCES

[1] J. Hauser and J. Wawrynek , “Garp: A MIPS Processor witha Reconfigurable Coprocessor ,” The 5th Annual IEEE Sym-posium on FPGAs for Custom Computing Machines., pp. 12-21, April 1997.

[2] E. Mirsky and A. DeHon , “MATRIX: A ReconfigurableComputing Architecture with Configurable Instruction Dis-tribution and Deployable Resources ,” IEEE Symposiumon FPGAs for Custom Computing Machines., pp. 157-166,April 1996.

[3] H. Schmit. et al. , “PipeRench: A virtualized programmabledatapath in 0.18 micron technology ,” IEEE Custom Inte-grated Circuits Conference., pp. 63-66, May 2002.

[4] H. Singh. et al. , “MorphoSys: An Integrated ReconfigurableSystem for Data-Parallel and Computation-Intensive Appli-cations ,” IEEE Transactions on Computers., vol. 49, no. 5,pp. 465-481, May 2000.

[5] T. Miyamori and K. Olukotun , “REMARC: A Quantita-tive Analysis of Reconfigurable Coprocessors for Multime-dia Applications ,” IEEE Symposium on FPGAs for CustomComputing Machines., pp. 2-11, April 1998.

[6] B. Salefski and L. Caglar , “Re-configurable computing inwireless ,” IEEE Design Automation Conference., pp. 178-183, Jan. 2002.

[7] “CS2000 Reconfigurable Communications Processor ,” fam-ily product brief, Chameleon System, Inc., 2000.

[8] “TMS320C6000 Assembly Benchmarks at Texas Instru-ments: C64X DSP Benchmarks ,” Texas Instruments Inc.

Date post:	10-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

VLSI ARCHITECTURE OF THE RECONFIGURABLE COMPUTING … · which need more ﬂexible and powerful...

Documents