[IEEE 2009 IEEE International SOC Conference (SOCC) - Belfast, Northern Ireland...

A Radix 22 Based Parallel Pipeline FFT Processorfor MB-OFDM UWB system

Nuo Li and N.P. van der MeijsFaculty of Electrical Engineering, Mathematics and Computer Science (EEMCS)

Delft University of Technology, Delft, Netherlands

Email: [email protected]

Abstract—This paper presents a novel parallel pipeline FFTprocessor especially tailored for Multiband Orthogonal Fre-quency Division Multiplexing (MB-OFDM) Ultra Wideband(UWB) system, which was defined by ECMA International. Theproposed Radix 22 Parallel Pipeline processor, which employs twoparallel data path Radix 22 algorithm and single-path delay feed-back (SDF) pipeline architecture, is a small-area and low-power-consumption solution for MB-OFDM UWB system. Both FPGAXilinx Virtex4 and ASIC 90 nm technology, 1V supply voltagetargeted synthesis results of this architecture are presented. It isshown from the results that, due to the revised algorithm andnovel architecture, the required clock frequency is 264MHz tomeet the ECMA requirement. Meanwhile, the required gates are39000 without testing block and the corresponding area is 181140μm2.

I. INTRODUCTION

Ultra-Wideband (UWB) Technology brings the convenience

and mobility of wireless communications to high-speed inter-

connects in devices through out the digital home and office

[1]. Multiband-OFDM standard is one solution for UWB

technology. A proposal for Multi-band OFDM UWB standard

is published by IEEE 802.15 3a study group [2]. In December

2007, the second revised version Standard ECMA-368 was

released, which specified physical layer (PHY) and medium

access control layer (MAC) of the UWB technology based on

Multiband-OFDM [3].

Some key issues need to be solved for designing CMOS

based Multiband-OFDM UWB solution in support of the low

power requirement. One of the issues focuses on its FFT (Fast

Fourier Transform) block, which takes 25% design complexity

of the total digital baseband transceiver [4]. Although many

results have already been published in this research area for the

past few years [5], [6], [7], the area and power consumption

of the FFT block still need to be improved since this system

targets for the wireless portable devices. Therefore, this paper

focuses on the area and power consumption improvement

under the ECMA-368 standard requirements. Section II de-

scribes the requirements for the FFT block and the algorithm

which the proposed design is based on. Section III focuses

on presenting the proposed FFT solution from algorithm,

architecture, and implementation level respectively. Section

IV shows the synthesis results both targeted for FPGA and

ASIC implementation. Meanwhile, the comparison with other

published implementations is also presented.

II. BACKGROUND

A. The Requirements of FFT for Multiband OFDM System

According to the ECMA-368, the required sampling fre-

quency is 528MHz and the total number of subcarriers, which

determines the FFT size, is 128. The time period available for

the IFFT and FFT is 242.42ns, which is the inverse of sampling

frequency multiplying the FFT size (TFFT = 128 1

fs). There

are 37 zero padded suffix samples, which take 70.08ns. So the

total symbol interval is 312.5ns (TSY M = TFFT + TZPS).

The word length choice is a critical issue for FFT processor

design. The trade-off between chip area consideration and

signal to quantization noise ratio (SQNR) directly determines

the choice. Based on the analysis of [5] and [8], the word

length is chosen to be 10 bits in this paper for simulation and

comparison with their designs.

B. The Selection of FFT Algorithms

The traditional radix 2 FFT algorithms have simple structure

and clear data flow, which are easy to implement and are

suitable for generic FFT implementation. Nevertheless, these

algorithms need large memory to store data at inner stages,

which require large power and area consumption. Nowadays,

there are two trends for FFT implementation of OFDM system,

the mixed radix algorithms, such as [7] and the pipeline

structure based algorithms, such as [9]. Based on extensive

algorithm analysis and selection, the proposed design employs

the Radix 22 algorithm developed by He and Torkelson [10],

which integrates the twiddle factor decomposition every two

stages. The Radix 22 algorithm has the same multiplicative

complexity as radix 4 algorithm, but retains the butterfly

structure of radix 2 algorithm, which is very suitable for ASIC

implementation.

The detailed algorithm deduction can be found in [10]. Its

application to 8 point FFT is used here to briefly explain the

algorithm, which is shown in Figure 1. In this application the

Radix 22 algorithm is only used once for the first two stages,

because 8 point DFT can only be decomposed once by radix

4. For the last stage, normal radix 2 DIF algorithm is used. By

using Radix 22 algorithm, complex multiplication of the twid-

dle factor in the first stage is changed into multiplying (−j).

Therefore, in a pipeline structure, one complex multiplier can

be reduced for 8 point FFT.

��

Fig. 1. Radix 22 based parallel FFT algorithm data flow

III. THE PROPOSED PROCESSOR

The proposed processor is described from the algorithm,

architecture and implementation level respectively.

A. The Revision in the Algorithm Level

After the analysis of the normal Radix 22 algorithm, it is

found that the input data can also be separated into the odd and

even parts and these odd and even parts are not mixed until the

last stage. It is one of the key points of proposed processor,

which can be effectively used for architecture design in order

to reduce the working frequency and used registers.

Eight point FFT data flow is again used here to illustrate the

changes, which are also shown in Figure 1. The dashed lines

show the odd input data flow while the solid lines show the

even input data flow. For the first and second stage, there is no

cross between the dashed lines and solid lines, which means

the even and odd input data can be separately processed in

the first and second stages. Only in the last stage, the dashed

lines and solid lines are crossed which means that the even

and odd data should be mixed to process.

The 128 point parallel algorithm data flow with twiddle

factor position is shown in Figure 2. The horizontal lines

are not shown here. The input data and twiddle factors are

separated into the even and odd data, which are processed

especially through the first six stages and only to be combined

in the final seventh stage. Please note that the output data are

ultimately produced in bit reversed order.

B. Architecture Level

From the previous analysis, employing two-path parallelism

in the first six stages is proper for the structure design. Because

these six stages can process the even and odd input data

separately and the last stage, the seventh stage, needs to mix

the even and odd data. Nevertheless, there are some extra

requirements for this architecture design. First, a demultiplexer

is required to separate the input data into the even and odd

parts. On the other hand, the controller can be shared for both

even and odd path. Special care should be taken to generate

the right control signals for the last stage such that the even

and odd parts can be combined in the proper way.

Figure 3 shows the proposed parallel pipeline architecture.

It has seven stages and consists of demultiplexers, circu-

lar buffers, ROM, complex multipliers, and butterfly units.

Fig. 2. The 128 point parallel Radix 22 based algorithm data flow

BF1 means butterfly type 1, which consists of four 2-to-1

multiplexers and four adders. BF2 means butterfly type 2,

which includes extra real and imaginary parts switching and

iversing because of the (−j) multiplication required by Radix

22 algorithm. First, the input data are streamed in and handled

by demultiplexer. These data are processed in the even and odd

parts of the architecture, where the dashed arrow lines stand

for the data flow of odd data and the solid lines show the even

data. For each odd and even part, single-path delay feedback

(SDF) pipeline structure [10] is used to process data separately.

There are three controllers which produce the control signals

and the addresses for reading the twiddle factor from the ROM.

The even and odd parts of each stage share the same controller.

There are five complex multiplications in the architecture. In

the sixth stage, the even part outputs do not need multiplication

and twiddle factor storage, which can be found in Figure 3.

The reason is that, after twiddle factor separation in this stage,

all the twiddle factors in the even part become constant 1.

Therefore, no multiplication is required.

C. Implementation Level

As can be seen from Figure 3, there are seven stages. Based

on the required control, it is advantage to combine the stages

��

Fig. 3. The parallel Radix 22 based pipeline architecture

1 and 2, stages 3 and 4, and stages 5 and 6 to three common

controller blocks. These common controller blocks all have a

structure as shown in Figure 4. Therefore, the whole parallel

architecture can also be divided into the first three common

controller blocks, the last block, and the arithmetic blocks. The

arithmetic blocks are composed of five ROMs and complex

multipliers.

Fig. 4. The common controller block

The basic idea of the data flow in these common controller

blocks is that the stage 1 repeats after calculating Nr2r data,

and the stage 2 repeats after calculating Nr2r+1 data, where r

(r = 1,2,3) is the index of the common controller blocks and

N is the FFT size. Only one counter is used to produce the

control signal I and II for both stage 1 and stage 2. For the

first common controller block, first, control signal I is set to

zero to let the N4

data be read into the stage 1, and in the

following N4

cycles, control signal I is set to one to enable

the butterfly function in stage 1. At the same time, the stage 2

reads in the N8

data outputs of the stage 1, which is controlled

by control signal II. The next N8

cycles, butterfly II of stage2

works and control signal II equals one. The data flow analysis

is shown in the Figure 5.

Fig. 5. The operation modes of the block

The last block only includes the seventh stage. Because the

odd and even data need to be commutated, two demultiplexers

seem to be required to switch the data, as shown in the

Figure 3. However, this can be improved by analyzing the

scheduling of the last stage. It can be found that only one

butterfly is working per clock circle and the first output data

of the even path will be processed with the first output of the

odd path of the 6th stage. As long as the timing is matched,

the even path outputs will be processed with the odd path

ones correspondingly. Therefore, the two demultiplexers are

not necessary and only one butterfly in the last stage is required

to process the data. The modified structure of the last stage

and interface with previous stage is shown in Figure 6.

IV. IMPLEMENTATION AND RESULT ANALYSIS

A. FPGA ImplementationThe proposed design is synthesized and implemented by

Xilinx ISE which is targeted for FPGA Xilinx Virtex4 im-

plementation. The arithmetic blocks are directly mapped to

��

Fig. 6. The improved version of the 7th stage

DSP48 components in Xilinx Virtex4. Table I is the perfor-

mance of the proposed implementation and the comparison

with [7]. The table clearly shows the reduced resource count

of the proposed design compared with the implementation in

[7]. The reason is that the proposed design employs far less

memory blocks and complex multipliers.

TABLE ITHE COMPARISON WITH [7]

[7] proposedWord length (bits) 11 10Total Number Slice Registers 7390 717*Number used as Flip Flops 3860 457Total Number of 4 input LUTS 12749 2230Number of DSP48s 48 20

The used word length is lower than [7]. However, even when

the word length of proposed design is increased to 15, the total

equivalent gate count is still much lower than [7]. At 15 bits,

the total number slice registers, 4 input LUTS and DSP48s of

proposed design is 1052, 3600, and 20 respectively.

B. ASIC targeted results

The proposed design is also synthesized by Synopsys De-

sign Compiler which is targeted for ASIC implementation.

The synthesis library is Faraday 90nm standard cell library

[11], which is tailored for UMC 90 nm logic LL-RVT (lowK)

process. During the implementation stage of our processor, [8]

was published, which employed the similar parallel structure.

However, there are some key differences between these two

architectures. Specifically important differences are in the first

and last stages where the proposed design reduces the number

of shift registers and the latency of the processor. Table II

is the performance of the proposed implementation and the

comparison with other start-of-the-art designs. The table shows

that the number of used gates of the proposed design is only

55% of [8]. If 180 nm technology would be linear scaled to 90

nm, the area is reduced by a factor of 4. Hence, the design of

[12] in 180 nm would compare to a area of 616595.5 μm2 in

90 nm technology, which is still much larger than the proposed

design.

TABLE IITHE ASIC IMPLEMENTATION COMPARISON

proposed im-plementation

[8] [12]

Technology 90 nm, 1 V 0.18 μm, 1.8 V 0.18 μm, 1.8 VClock frequency (MHz) 264 450 250Parallel data format 2 data-path 2 data-path 4 data-path

Algorithm Radix 22 Radix 24 Mixed RadixWord length (bits) 10 10 10Complex multipliers 5 2+0.41 2+2.48Registers 128 190 -Gates 38540 70000 -

Area (μm2) 181140 - 2466382

Area (μm2) scaled for90 nm

181140 - 616595.5

V. CONCLUSION

In this paper, a novel parallel pipeline FFT processor is

designed for the ECMA-368 standard. Our architecture is

based on a revised version of the Radix 22 algorithm. Our

revision amounts to restructuring of the associated signal flow

graph into an even and odd part. As such, it not only achieves

the low multiplier count of the standard 22 algorithms, but

also a 50 % reduction of the clock frequency and the lowest

circular buffer count compared to the traditional SDF architec-

tures. Both FPGA and ASIC targeted synthesis results of this

architecture are presented. The results show that the required

area is dramatically reduced based on the proposed design.

REFERENCES

[1] INTEL, “Ultra-wideband (uwb) technology,”http://www.intel.com/technology/comms/uwb/.

[2] e. a. A. Batra, “Multi-band OFDM physical layer proposal for IEEE802.15 Task Group 3a,” Tech. Rep., IEEE P.802.15-04/0493r0, 2004.

[3] Standard ECMA-368: High Rate Ultra Wideband PHY and MAC Stan-dard 2nd Edition.

[4] A. Batra, J. Balakrishnan, G. Aiello, J. Foerster, and A. Dabak, “Designof a multiband OFDM system for realistic UWB channel environments,”Microwave Theory and Techniques, IEEE Transactions on, vol. 52, no. 9,pp. 2123–2138, Sept. 2004.

[5] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, “A 1-GS/s FFT/IFFT processorfor UWB applications,” Solid-State Circuits, IEEE Journal of, vol. 40,no. 8, pp. 1726–1735, Aug. 2005.

[6] R. Chidambaram, “A scalable and high-performance FFT processor,optimized for UWB-OFDM,” Master’s thesis, Delft University of Tech-nology, 2005.

[7] N. Rodrigues, H. Neto, and H. Sarmento, “A OFDM module for aMB-OFDM receiver,” Design & Technology of Integrated Systems inNanoscale Era, 2007. DTIS. International Conference on, pp. 25–29,Sept. 2007.

[8] J. Lee and H. Lee, “A High-Speed Two-Parallel Radix-24 FFT/IFFTProcessor for MB-OFDM UWB Systems,” IEICE Trans Fundamentals,vol. E91-A, no. 4, pp. 1206–1211, 2008.

[9] E. Saberinia, K. C. Chang, G. Sobelman, and A. H. Tewfik, “Imple-mentation of a Multi-band Pulsed-OFDM Transceiver,” J. VLSI SignalProcess. Syst., vol. 43, no. 1, pp. 73–88, 2006.

[10] S. He and M. Torkelson, “A new approach to pipeline FFT processor,”Parallel Processing Symposium, 1996., Proceedings of IPPS ’96, The10th International, pp. 766–770, Apr 1996.

[11] FARADAY, FSD0A A 90 nm Logic SP-RVT(Low-K) Process. FARA-DAY Technology Corporation, 2006.

[12] T. Chakraborty and S. Chakrabarti, “A reduced area 1 GSPS FFT designusing MRMDF architecture for UWB communication,” in Circuits andSystems, 2008. APCCAS 2008. IEEE Asia Pacific Conference on, 302008-Dec. 3 2008, pp. 1128–1131.

��

Date post:	13-Oct-2016
Category:	Documents
Upload:	np
View:	215 times
Download:	3 times

[IEEE 2009 IEEE International SOC Conference (SOCC) - Belfast, Northern Ireland...

Documents