A Radix 22 Based Parallel Pipeline FFT Processorfor MB-OFDM UWB system
Nuo Li and N.P. van der MeijsFaculty of Electrical Engineering, Mathematics and Computer Science (EEMCS)
Delft University of Technology, Delft, Netherlands
Email: [email protected]
Abstract—This paper presents a novel parallel pipeline FFTprocessor especially tailored for Multiband Orthogonal Fre-quency Division Multiplexing (MB-OFDM) Ultra Wideband(UWB) system, which was defined by ECMA International. Theproposed Radix 22 Parallel Pipeline processor, which employs twoparallel data path Radix 22 algorithm and single-path delay feed-back (SDF) pipeline architecture, is a small-area and low-power-consumption solution for MB-OFDM UWB system. Both FPGAXilinx Virtex4 and ASIC 90 nm technology, 1V supply voltagetargeted synthesis results of this architecture are presented. It isshown from the results that, due to the revised algorithm andnovel architecture, the required clock frequency is 264MHz tomeet the ECMA requirement. Meanwhile, the required gates are39000 without testing block and the corresponding area is 181140μm2.
I. INTRODUCTION
Ultra-Wideband (UWB) Technology brings the convenience
and mobility of wireless communications to high-speed inter-
connects in devices through out the digital home and office
[1]. Multiband-OFDM standard is one solution for UWB
technology. A proposal for Multi-band OFDM UWB standard
is published by IEEE 802.15 3a study group [2]. In December
2007, the second revised version Standard ECMA-368 was
released, which specified physical layer (PHY) and medium
access control layer (MAC) of the UWB technology based on
Multiband-OFDM [3].
Some key issues need to be solved for designing CMOS
based Multiband-OFDM UWB solution in support of the low
power requirement. One of the issues focuses on its FFT (Fast
Fourier Transform) block, which takes 25% design complexity
of the total digital baseband transceiver [4]. Although many
results have already been published in this research area for the
past few years [5], [6], [7], the area and power consumption
of the FFT block still need to be improved since this system
targets for the wireless portable devices. Therefore, this paper
focuses on the area and power consumption improvement
under the ECMA-368 standard requirements. Section II de-
scribes the requirements for the FFT block and the algorithm
which the proposed design is based on. Section III focuses
on presenting the proposed FFT solution from algorithm,
architecture, and implementation level respectively. Section
IV shows the synthesis results both targeted for FPGA and
ASIC implementation. Meanwhile, the comparison with other
published implementations is also presented.
II. BACKGROUND
A. The Requirements of FFT for Multiband OFDM System
According to the ECMA-368, the required sampling fre-
quency is 528MHz and the total number of subcarriers, which
determines the FFT size, is 128. The time period available for
the IFFT and FFT is 242.42ns, which is the inverse of sampling
frequency multiplying the FFT size (TFFT = 128 1
fs). There
are 37 zero padded suffix samples, which take 70.08ns. So the
total symbol interval is 312.5ns (TSY M = TFFT + TZPS).
The word length choice is a critical issue for FFT processor
design. The trade-off between chip area consideration and
signal to quantization noise ratio (SQNR) directly determines
the choice. Based on the analysis of [5] and [8], the word
length is chosen to be 10 bits in this paper for simulation and
comparison with their designs.
B. The Selection of FFT Algorithms
The traditional radix 2 FFT algorithms have simple structure
and clear data flow, which are easy to implement and are
suitable for generic FFT implementation. Nevertheless, these
algorithms need large memory to store data at inner stages,
which require large power and area consumption. Nowadays,
there are two trends for FFT implementation of OFDM system,
the mixed radix algorithms, such as [7] and the pipeline
structure based algorithms, such as [9]. Based on extensive
algorithm analysis and selection, the proposed design employs
the Radix 22 algorithm developed by He and Torkelson [10],
which integrates the twiddle factor decomposition every two
stages. The Radix 22 algorithm has the same multiplicative
complexity as radix 4 algorithm, but retains the butterfly
structure of radix 2 algorithm, which is very suitable for ASIC
implementation.
The detailed algorithm deduction can be found in [10]. Its
application to 8 point FFT is used here to briefly explain the
algorithm, which is shown in Figure 1. In this application the
Radix 22 algorithm is only used once for the first two stages,
because 8 point DFT can only be decomposed once by radix
4. For the last stage, normal radix 2 DIF algorithm is used. By
using Radix 22 algorithm, complex multiplication of the twid-
dle factor in the first stage is changed into multiplying (−j).
Therefore, in a pipeline structure, one complex multiplier can
be reduced for 8 point FFT.
������������ ������ ��� ������ ���
Fig. 1. Radix 22 based parallel FFT algorithm data flow
III. THE PROPOSED PROCESSOR
The proposed processor is described from the algorithm,
architecture and implementation level respectively.
A. The Revision in the Algorithm Level
After the analysis of the normal Radix 22 algorithm, it is
found that the input data can also be separated into the odd and
even parts and these odd and even parts are not mixed until the
last stage. It is one of the key points of proposed processor,
which can be effectively used for architecture design in order
to reduce the working frequency and used registers.
Eight point FFT data flow is again used here to illustrate the
changes, which are also shown in Figure 1. The dashed lines
show the odd input data flow while the solid lines show the
even input data flow. For the first and second stage, there is no
cross between the dashed lines and solid lines, which means
the even and odd input data can be separately processed in
the first and second stages. Only in the last stage, the dashed
lines and solid lines are crossed which means that the even
and odd data should be mixed to process.
The 128 point parallel algorithm data flow with twiddle
factor position is shown in Figure 2. The horizontal lines
are not shown here. The input data and twiddle factors are
separated into the even and odd data, which are processed
especially through the first six stages and only to be combined
in the final seventh stage. Please note that the output data are
ultimately produced in bit reversed order.
B. Architecture Level
From the previous analysis, employing two-path parallelism
in the first six stages is proper for the structure design. Because
these six stages can process the even and odd input data
separately and the last stage, the seventh stage, needs to mix
the even and odd data. Nevertheless, there are some extra
requirements for this architecture design. First, a demultiplexer
is required to separate the input data into the even and odd
parts. On the other hand, the controller can be shared for both
even and odd path. Special care should be taken to generate
the right control signals for the last stage such that the even
and odd parts can be combined in the proper way.
Figure 3 shows the proposed parallel pipeline architecture.
It has seven stages and consists of demultiplexers, circu-
lar buffers, ROM, complex multipliers, and butterfly units.
Fig. 2. The 128 point parallel Radix 22 based algorithm data flow
BF1 means butterfly type 1, which consists of four 2-to-1
multiplexers and four adders. BF2 means butterfly type 2,
which includes extra real and imaginary parts switching and
iversing because of the (−j) multiplication required by Radix
22 algorithm. First, the input data are streamed in and handled
by demultiplexer. These data are processed in the even and odd
parts of the architecture, where the dashed arrow lines stand
for the data flow of odd data and the solid lines show the even
data. For each odd and even part, single-path delay feedback
(SDF) pipeline structure [10] is used to process data separately.
There are three controllers which produce the control signals
and the addresses for reading the twiddle factor from the ROM.
The even and odd parts of each stage share the same controller.
There are five complex multiplications in the architecture. In
the sixth stage, the even part outputs do not need multiplication
and twiddle factor storage, which can be found in Figure 3.
The reason is that, after twiddle factor separation in this stage,
all the twiddle factors in the even part become constant 1.
Therefore, no multiplication is required.
C. Implementation Level
As can be seen from Figure 3, there are seven stages. Based
on the required control, it is advantage to combine the stages
���
Fig. 3. The parallel Radix 22 based pipeline architecture
1 and 2, stages 3 and 4, and stages 5 and 6 to three common
controller blocks. These common controller blocks all have a
structure as shown in Figure 4. Therefore, the whole parallel
architecture can also be divided into the first three common
controller blocks, the last block, and the arithmetic blocks. The
arithmetic blocks are composed of five ROMs and complex
multipliers.
Fig. 4. The common controller block
The basic idea of the data flow in these common controller
blocks is that the stage 1 repeats after calculating Nr2r data,
and the stage 2 repeats after calculating Nr2r+1 data, where r
(r = 1,2,3) is the index of the common controller blocks and
N is the FFT size. Only one counter is used to produce the
control signal I and II for both stage 1 and stage 2. For the
first common controller block, first, control signal I is set to
zero to let the N4
data be read into the stage 1, and in the
following N4
cycles, control signal I is set to one to enable
the butterfly function in stage 1. At the same time, the stage 2
reads in the N8
data outputs of the stage 1, which is controlled
by control signal II. The next N8
cycles, butterfly II of stage2
works and control signal II equals one. The data flow analysis
is shown in the Figure 5.
Fig. 5. The operation modes of the block
The last block only includes the seventh stage. Because the
odd and even data need to be commutated, two demultiplexers
seem to be required to switch the data, as shown in the
Figure 3. However, this can be improved by analyzing the
scheduling of the last stage. It can be found that only one
butterfly is working per clock circle and the first output data
of the even path will be processed with the first output of the
odd path of the 6th stage. As long as the timing is matched,
the even path outputs will be processed with the odd path
ones correspondingly. Therefore, the two demultiplexers are
not necessary and only one butterfly in the last stage is required
to process the data. The modified structure of the last stage
and interface with previous stage is shown in Figure 6.
IV. IMPLEMENTATION AND RESULT ANALYSIS
A. FPGA ImplementationThe proposed design is synthesized and implemented by
Xilinx ISE which is targeted for FPGA Xilinx Virtex4 im-
plementation. The arithmetic blocks are directly mapped to
���
Fig. 6. The improved version of the 7th stage
DSP48 components in Xilinx Virtex4. Table I is the perfor-
mance of the proposed implementation and the comparison
with [7]. The table clearly shows the reduced resource count
of the proposed design compared with the implementation in
[7]. The reason is that the proposed design employs far less
memory blocks and complex multipliers.
TABLE ITHE COMPARISON WITH [7]
[7] proposedWord length (bits) 11 10Total Number Slice Registers 7390 717*Number used as Flip Flops 3860 457Total Number of 4 input LUTS 12749 2230Number of DSP48s 48 20
The used word length is lower than [7]. However, even when
the word length of proposed design is increased to 15, the total
equivalent gate count is still much lower than [7]. At 15 bits,
the total number slice registers, 4 input LUTS and DSP48s of
proposed design is 1052, 3600, and 20 respectively.
B. ASIC targeted results
The proposed design is also synthesized by Synopsys De-
sign Compiler which is targeted for ASIC implementation.
The synthesis library is Faraday 90nm standard cell library
[11], which is tailored for UMC 90 nm logic LL-RVT (lowK)
process. During the implementation stage of our processor, [8]
was published, which employed the similar parallel structure.
However, there are some key differences between these two
architectures. Specifically important differences are in the first
and last stages where the proposed design reduces the number
of shift registers and the latency of the processor. Table II
is the performance of the proposed implementation and the
comparison with other start-of-the-art designs. The table shows
that the number of used gates of the proposed design is only
55% of [8]. If 180 nm technology would be linear scaled to 90
nm, the area is reduced by a factor of 4. Hence, the design of
[12] in 180 nm would compare to a area of 616595.5 μm2 in
90 nm technology, which is still much larger than the proposed
design.
TABLE IITHE ASIC IMPLEMENTATION COMPARISON
proposed im-plementation
[8] [12]
Technology 90 nm, 1 V 0.18 μm, 1.8 V 0.18 μm, 1.8 VClock frequency (MHz) 264 450 250Parallel data format 2 data-path 2 data-path 4 data-path
Algorithm Radix 22 Radix 24 Mixed RadixWord length (bits) 10 10 10Complex multipliers 5 2+0.41 2+2.48Registers 128 190 -Gates 38540 70000 -
Area (μm2) 181140 - 2466382
Area (μm2) scaled for90 nm
181140 - 616595.5
V. CONCLUSION
In this paper, a novel parallel pipeline FFT processor is
designed for the ECMA-368 standard. Our architecture is
based on a revised version of the Radix 22 algorithm. Our
revision amounts to restructuring of the associated signal flow
graph into an even and odd part. As such, it not only achieves
the low multiplier count of the standard 22 algorithms, but
also a 50 % reduction of the clock frequency and the lowest
circular buffer count compared to the traditional SDF architec-
tures. Both FPGA and ASIC targeted synthesis results of this
architecture are presented. The results show that the required
area is dramatically reduced based on the proposed design.
REFERENCES
[1] INTEL, “Ultra-wideband (uwb) technology,”http://www.intel.com/technology/comms/uwb/.
[2] e. a. A. Batra, “Multi-band OFDM physical layer proposal for IEEE802.15 Task Group 3a,” Tech. Rep., IEEE P.802.15-04/0493r0, 2004.
[3] Standard ECMA-368: High Rate Ultra Wideband PHY and MAC Stan-dard 2nd Edition.
[4] A. Batra, J. Balakrishnan, G. Aiello, J. Foerster, and A. Dabak, “Designof a multiband OFDM system for realistic UWB channel environments,”Microwave Theory and Techniques, IEEE Transactions on, vol. 52, no. 9,pp. 2123–2138, Sept. 2004.
[5] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, “A 1-GS/s FFT/IFFT processorfor UWB applications,” Solid-State Circuits, IEEE Journal of, vol. 40,no. 8, pp. 1726–1735, Aug. 2005.
[6] R. Chidambaram, “A scalable and high-performance FFT processor,optimized for UWB-OFDM,” Master’s thesis, Delft University of Tech-nology, 2005.
[7] N. Rodrigues, H. Neto, and H. Sarmento, “A OFDM module for aMB-OFDM receiver,” Design & Technology of Integrated Systems inNanoscale Era, 2007. DTIS. International Conference on, pp. 25–29,Sept. 2007.
[8] J. Lee and H. Lee, “A High-Speed Two-Parallel Radix-24 FFT/IFFTProcessor for MB-OFDM UWB Systems,” IEICE Trans Fundamentals,vol. E91-A, no. 4, pp. 1206–1211, 2008.
[9] E. Saberinia, K. C. Chang, G. Sobelman, and A. H. Tewfik, “Imple-mentation of a Multi-band Pulsed-OFDM Transceiver,” J. VLSI SignalProcess. Syst., vol. 43, no. 1, pp. 73–88, 2006.
[10] S. He and M. Torkelson, “A new approach to pipeline FFT processor,”Parallel Processing Symposium, 1996., Proceedings of IPPS ’96, The10th International, pp. 766–770, Apr 1996.
[11] FARADAY, FSD0A A 90 nm Logic SP-RVT(Low-K) Process. FARA-DAY Technology Corporation, 2006.
[12] T. Chakraborty and S. Chakrabarti, “A reduced area 1 GSPS FFT designusing MRMDF architecture for UWB communication,” in Circuits andSystems, 2008. APCCAS 2008. IEEE Asia Pacific Conference on, 302008-Dec. 3 2008, pp. 1128–1131.
���