+ All Categories
Home > Documents > [IEEE 2007 IEEE International Conference on Electro/Information Technology - Chicago, IL, USA...

[IEEE 2007 IEEE International Conference on Electro/Information Technology - Chicago, IL, USA...

Date post: 06-Jan-2017
Category:
Upload: ngokien
View: 232 times
Download: 2 times
Share this document with a friend
6
IEEE EIT 2007 Proceedings 345 . 1-4244-0941-1/07/$25.00 c 2007 IEEE Efficient Memory-Based FFT Processors for OFDM Applications Chin-Long Wey, Shin-Yo Lin, and Wei-Chien Tang Department of Electrical Engineering, National Central University, Jhongli, Taiwan e-mail: [email protected]; URL: www.ee.ncu.edu.tw/~clwey Abstract- This paper presents Radix-2 memory-based FFT (MBFFT) processors. Taking the advantages of low hardware cost of MBFFT architectures, this study improves the speed performance. The improvement was achieved by an efficient memory retrieval scheme for reducing the control complexity and a clock scheme with parallel structures for reducing the cycle times and latency. Instead of using dual-port memories for data storage and retrieval, our designs use single-port memories with pre-fetch registers for hardware cost reduction. Based on the pre-layout simulation results, the core area of the developed MBFFT is 2.04mm 2 with the maximal work fre- quency of 198MHz for N=8192 points (24bits per word). I. Introduction OFDM (Orthogonal Frequency Division Multiplexing) [1,2], a form of multi-carrier modulation technology, is a spe- cial case of multi-carrier transmission, where a single data stream is transmitted over a number of lower rate sub-carriers. OFDM technique has been widely implemented in high-speed digital communications to increase the robustness against fre- quency selective fading or narrowband interface. It is also used for wideband data communications over mobile radio FM channels, xDSL, DAB, and DVB-T/H. In these application, efficient FFT (Fast Fourier Transformation) processors are required for real-time operation FFT architectures can be classified into two categories: (1) Pipelined architectures [3-8]; and (2) Memory-based archi- tectures [9-13]. Taking the advantage of structure regularity in VLSI implementation, the pipelined architecture employes more Processing Elements (PEs) to achieve higher perfor- mance than its counterpart. On the other hand, memory-based architecture requires only one butterfly PE, as well as some memory blocks for storing input and intermediate data, to per- form the real-time operation. Because of high-speed and low control complexity, pipelined architectures are commonly used for many applications at the cost of increased chip area. Taking the advantages of low hardware cost of memory- based FFT (MBFFT) architectures, this study is to improve the speed performance. The improvement can be achieved by an efficient memory retrieval scheme for reducing the control complexity and a clock scheme with parallel structures for reducing the cycle times and latency. Instead of using dual-port memories for data storage and retrieval, our designs use single- port memories with pre-fetch registers for hardware cost reduction. The use of single-ported memories also contributes to the cycle time reduction. In general, latency is also an impor- tant parameter for OFDM applications. This study will demon- strates the improvement of latency using multiple PEs and also addresses the design issues. II. Memory-Based FFT Processors A. Basic FFT Operation The DFT (Discrete Fourier Transform) of a finite-length sequence of length N is ; k=0,1,...,(N-1) (1) where W N kn =e -j(2π/N)nk . Note that X[k] and x[n] may be com- plex numbers. Consider the implementation of radix-2 decima- tion-in-frequency (DIT) FFT process, for any integer r=0,1, ..,(N/2)-1, (2) and (3) A butterfly unit is used to compute both (2) and (3) for both x[n] and x[N/2+n] as y[n] = x[n] + x[(N/2)+n] (4) y[(N/2)+n]= (x[n] - x[(N/2)+n])W N n (5) Figure 1 shows the signal flow graph of 8-point FFT. B. Memory-Based FFT (MBFFT) Architectures A typical MBFFT architecture is shown in Figure 2, in which five dual-ported (N/4)-RAMs were employed. Note the RAM5 was mainly used as a buffer for temporarily storing the computed data. Thus, the architecture requires a total memory Xk [] xn [] W N kn n 0 = N 1 = X 2 r [ ] xn [] xn N 2 ( ) + [ ] + ( ) W N 2 nr n 0 = N 2 ( ) 1 = X 2r 1 + [ ] xn [] xn N 2 ( ) + [ ] ( ) W N n W N 2 nr n 0 = N 2 ( ) 1 = Figure 1. Signal Flow Graph for Radix-2 FFT with N=8. x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] -1 -1 -1 -1 W N 0 W N 0 W N 0 W N 0 W N 0 W N 0 -1 -1 -1 -1 W N 0 W N 2 W N 2 W N 2 W N 1 W N 3 -1 -1 -1 -1
Transcript
Page 1: [IEEE 2007 IEEE International Conference on Electro/Information Technology - Chicago, IL, USA (2007.05.17-2007.05.20)] 2007 IEEE International Conference on Electro/Information Technology

IEEE EIT 2007 Proceedings 345.

1-4244-0941-1/07/$25.00 c©2007 IEEE

Efficient Memory-Based FFT Processors for OFDM Applications

Chin-Long Wey, Shin-Yo Lin, and Wei-Chien TangDepartment of Electrical Engineering, National Central University, Jhongli, Taiwan

e-mail: [email protected]; URL: www.ee.ncu.edu.tw/~clwey

Abstract- This paper presents Radix-2 memory-based FFT(MBFFT) processors. Taking the advantages of low hardwarecost of MBFFT architectures, this study improves the speedperformance. The improvement was achieved by an efficientmemory retrieval scheme for reducing the control complexityand a clock scheme with parallel structures for reducing thecycle times and latency. Instead of using dual-port memoriesfor data storage and retrieval, our designs use single-portmemories with pre-fetch registers for hardware cost reduction.Based on the pre-layout simulation results, the core area of thedeveloped MBFFT is 2.04mm2 with the maximal work fre-quency of 198MHz for N=8192 points (24bits per word).

I. IntroductionOFDM (Orthogonal Frequency Division Multiplexing)

[1,2], a form of multi-carrier modulation technology, is a spe-cial case of multi-carrier transmission, where a single datastream is transmitted over a number of lower rate sub-carriers.OFDM technique has been widely implemented in high-speeddigital communications to increase the robustness against fre-quency selective fading or narrowband interface. It is also usedfor wideband data communications over mobile radio FMchannels, xDSL, DAB, and DVB-T/H. In these application,efficient FFT (Fast Fourier Transformation) processors arerequired for real-time operation

FFT architectures can be classified into two categories:(1) Pipelined architectures [3-8]; and (2) Memory-based archi-tectures [9-13]. Taking the advantage of structure regularity inVLSI implementation, the pipelined architecture employesmore Processing Elements (PEs) to achieve higher perfor-mance than its counterpart. On the other hand, memory-basedarchitecture requires only one butterfly PE, as well as somememory blocks for storing input and intermediate data, to per-form the real-time operation. Because of high-speed and lowcontrol complexity, pipelined architectures are commonlyused for many applications at the cost of increased chip area.

Taking the advantages of low hardware cost of memory-based FFT (MBFFT) architectures, this study is to improve thespeed performance. The improvement can be achieved by anefficient memory retrieval scheme for reducing the controlcomplexity and a clock scheme with parallel structures forreducing the cycle times and latency. Instead of using dual-portmemories for data storage and retrieval, our designs use single-port memories with pre-fetch registers for hardware costreduction. The use of single-ported memories also contributes

to the cycle time reduction. In general, latency is also an impor-tant parameter for OFDM applications. This study will demon-strates the improvement of latency using multiple PEs and alsoaddresses the design issues.

II. Memory-Based FFT ProcessorsA. Basic FFT Operation

The DFT (Discrete Fourier Transform) of a finite-lengthsequence of length N is

; k=0,1,...,(N-1) (1)

where WNkn=e-j(2π/N)nk. Note that X[k] and x[n] may be com-

plex numbers. Consider the implementation of radix-2 decima-tion-in-frequency (DIT) FFT process, for any integer r=0,1,..,(N/2)-1,

(2)

and

(3)

A butterfly unit is used to compute both (2) and (3) for bothx[n] and x[N/2+n] as

y[n] = x[n] + x[(N/2)+n] (4)y[(N/2)+n]= (x[n] - x[(N/2)+n])WN

n (5)Figure 1 shows the signal flow graph of 8-point FFT.

B. Memory-Based FFT (MBFFT) ArchitecturesA typical MBFFT architecture is shown in Figure 2, in

which five dual-ported (N/4)-RAMs were employed. Note theRAM5 was mainly used as a buffer for temporarily storing thecomputed data. Thus, the architecture requires a total memory

X k[ ] x n[ ]WNkn

n 0=

N 1–

∑=

X 2r[ ] x n[ ] x n N 2⁄( )+[ ]+( )WN 2⁄nr

n 0=

N 2⁄( ) 1–

∑=

X 2r 1+[ ] x n[ ] x n N 2⁄( )+[ ]–( )WNnWN 2⁄

nr

n 0=

N 2⁄( ) 1–

∑=

Figure 1. Signal Flow Graph for Radix-2 FFT with N=8.

x[0]

x[1]

x[2]

x[3]

x[4]

x[5]

x[6]

x[7]

-1

-1

-1

-1

WN0

WN0

WN0

WN0

WN0

WN0

-1

-1

-1

-1

WN0

WN2

WN2

WN2

WN1

WN3

-1

-1

-1

-1

Page 2: [IEEE 2007 IEEE International Conference on Electro/Information Technology - Chicago, IL, USA (2007.05.17-2007.05.20)] 2007 IEEE International Conference on Electro/Information Technology

IEEE EIT 2007 Proceedings 346.

CLK

inpu

t RAM1 RAM2 Operations RAM1 RAM2 out-puts 00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11

1 x0 x02 x1 x13 x2 x24 x3 x35 x4 x0 b0=x0+x4; b4=(x0-x4)*w0 b0 b46 x5 x1 b1=x1+x5; b5=(x1-x5)*w1 b1 b57 x6 x2 b2=x2+x6; b6=(x2-x6)*w2 b6 b28 x7 x3 b3=x3+x7; b7=(x3-x7)*w3 b7 b39 b0 b2 a0=b0+b2; a2=(b0-b2)*w0 a0 a2

10 b1 b3 a1=b1+b3; a3=(b1-b3)*w2 a3 a111 b6 b4 a4=b4+b6; a6=(b4-b6*w0 a6 a412 b7 b5 a5=b5+b7; a7=(b5-b7)*w2 b5 b713 y0 a0 a1 z0=a0+a1; z1=(a0-a1*w0 y0 z0 z114 y1 a3 a2 z2=a2+a3; z3=(a2-a3)*w0 y1 z3 z215 y2 a6 a7 z4=a4+a5; z5=(a4-a5)*w0 y2 z6 z716 y3 a5 a4 z6=a6+a7; z7=(a6-a7)*w0 y3 z5 z4

size of 1.25N words with the latency of (N/2)+(N/2)logNcycles for N-points FFT operation [11].

Since the memories in MBFFT processor dominate theentire chip area, where the area ratio can be as high as 85%,we developed an alternative MBFFT process [12] with a to-tal memory size of N words with the latency of (N/4)+(N/2)logN cycles for N-points FFT operation, as shown in Fig-ure 3(a). Note that the architecture employs two (N/2)-words memories, namely, RAM1 and RAM2. Basically, thefirst (N/2) input data (x[0],x[1],..., x[N/2-1], are loaded andstored to RAM1. Then, the data x[N/2+k], k=0,1,..., (N/2)-1, loaded from input, and x[k], read from RAM1 at the ad-dress “k”, perform the following operations,

b[k] = x[k] + x[N/2+k]; b[N/2+k] = (x[k] - x[N/2+k]) * WN

n;where b[k] and b[N/2+k] are stored to the address “k” ofRAM1 and RAM2, respectively, as shown in Figure 3(b) forN=8.

For example, in clock #5, “b0=x0+x4; b4=(x0-x4)*w0”,the data x4 is loaded from the input and x0 is read from R1at the address “00”. Then, both data are processed and theresultant values, b0 and b4, are stored into R1 and R2 at theaddress “00”, respectively. In clock #10, “a1=b1+b3;a3=(b1-b3)*w2”, both data b1 and b3 are loaded from R1 at“01” and R2 at “11”, the resultant values, a3 and a1, arestored into R1 at “01” and R2 at “11”, respectively. In thisimplementation, the processed data may be stored back tothe addresses where its was loaded, or swapped to the loca-tion where the other data was loaded. In other words, at eachclock cycles, any pair of data to be processed are located atdifferent RAMs and only one address in each RAM isenabled.

Figures 4(a) and 4(b) present the control signals forMUXs and the memory access, respectively. In Figure 4(c),RA/WA stands for the address to be accessed for RA (read)or WA (write). There are two columns under RA/WA,where “1” and “2” represents RAM1 (R1) and RAM2 (R2),respectively. Note that DO and DI are the data output and in-put of the memories, respectively.

Figure 4(c) shows the data flow, where two RAMs areused to store the processed data and each RAM containsonly 4 bits at the locations 00, 01, 10, and 11. The upper halfdescribes the data retrieval in RAM1, where the memoryaddresses are given right after the inputs. Two data to beprocessed are located at different RAMs in Stage i, and theresultant data are stored at the locations in different RAMsin Stage (i+1). One can easily extend the patterns to any N.

In this implementation, the period of N/2 clock cycles isdefined as a phase. For N=8, each phase contains 4 clockcycles, where a two-bit counter (w1w0) is used to count theclock cycles in each phase. Interestingly, the control signal

++

+ _

InputData

RAM1 RAM2 RAM3 RAM4

RAM5

ROM

Figure 2. A Typical MBFFT Architecture [11].

Figure 3. A MBFFT Architecture [12]: (a) Schematic; and (b) FFT Operation with N=8.

RAM-1read

writeRAM-2

read

write

Ouptut

ROM

Input Data

0 1 01 00 10

0 1 1 0

000110

01

m1 m2

m3

m4mb

PE

++

+_

ma 0 1

Page 3: [IEEE 2007 IEEE International Conference on Electro/Information Technology - Chicago, IL, USA (2007.05.17-2007.05.20)] 2007 IEEE International Conference on Electro/Information Technology

IEEE EIT 2007 Proceedings 347.

Address R_en ROMRAM1 RAM2 R1 R2 R-en Addr

w9..w1w0 x x x x x x x x x x 1 1 1 x x x x x x x x x xw9..w1w0 w9w8w7w6w5...w0 0 1 0 w9w8......w3w2w1w0w9..w1w0 w9w8w7w6w5...w0 0 0 0 w8w7........w2w1w0 0w9..w1w0 w9w8w7w6w5...w0 0 0 0 w7w6.....w2w1w0 0 0w9..w1w0 w9w8w7w6w5...w0 0 0 0 w6w5......w1w0 0 0 0w9..w1w0 w9w8w7w6w5..w0 0 0 0 w5w4....w1w0 0 0 0 0w9..w1w0 w9w8...w5w4....w0 0 0 0 w4w3.....w0 0 0 0 0 0w9..w1w0 w9w8...w4w3....w0 0 0 0 w3w2..w0 0 0 0 0 0 0w9..w1w0 w9w8...w3w2w1w0 0 0 0 w2w1w0 0 0 0 0 0 0 0w9..w1w0 w9w8w7...w2w1w0 0 0 0 w1w0 0 0 0 0 0 0 0 0w9..w1w0 w9w8w7...w2w1w0 0 0 0 w0 0 0 0 0 0 0 0 0 0 w9..w1w0 w9w8w7...w2w1w0 0 0 0 0 0 0 0 0 0 0 0 0 0

CLK Phase ma m3 m4 m1 m2 mb1-1024 0 0 00 x x xx 0

1025-2048 1 1 w9w9 w9 0 10 02049-3072 2 x w8w8 w8 w9 0w9 03073-4096 3 x w7w7 w7 w8 0w8 04097-5120 4 x w6w6 w6 w7 0w7 05121-6144 5 x w5w5 w5 w6 0w6 06145-7168 6 x w4w4 w4 w5 0w5 07169-8192 7 x w3w3 w3 w4 0w4 08193-9216 8 x w2w2 w2 w3 0w3 0

9217-10240 9 x w1w1 w1 w2 0w2 010241-11264 10 x w0w0 w0 w1 0w1 011265-12288 11 0 00 0 w0 0w0 1

Counter=(w9w8w7w6w5w4w3w2w1w0)

Figure 5. MBFFT: (a) & (b) Control Signals of MBFFT with N=2048 for MUXs; and Memory Access;(a)(b)

in each phase can be represented in terms of the counterwights, i.e., w1 and w0. Note that the 16 cycles in Figure 4(a)can be divided into four phases and the control signals forMUXs, RAMs, and ROM in each phase can be derived asshown in Figures 5(d) and 5(e). Thus, the control circuitry iscomprised of the 2-bit counter. Note that the control signalscan be synthesized using random logics. However, for struc-tural simplicity and regularity, we use a simple finite statemachine (FSM) to realize the control signals. In that imple-mentation, the FSM can be easily extended FFT operationswith any N. This was one of the salient features in the devel-oped MBFFT processor. Figure 5 shows the control signalsof both MUXs and memory access for N=2048, where thereare 12 phases and each phase contains 1024 cycles, and a10-bit counter with (w9w8...w1w0) can be employed.

Our simulation results show that the developed MBFFTPtakes [(N/2)+(N/2)logN] cycles to complete the first set ofdata. Because of the data overlapping shown in Figure 3(b),the MBFFT in Figure 3(a) takes [(N/2)logN] cycles to com-plete each following set of data. Thus, the average latency

of two data processes is [(N/4)+(N/2)logN]. Figure 5(c) shows the layout view of MBFFT with

N=8192, where the memories and circuitry are synthesizedby the Artisan Tool and Design Compiler, respectively,using the TSMC 0.18µm 1P6M digital CMOS process. Notethat two 4K dual-ported RAMs were employed, where eachword contains 24 bits, i.e., 12 bits in complext numbers.Experimental results show that the maximum work fre-quency is approximately 117 MHz, and the core area is4.12mm2.

Note that the total area of the two 4K dual ported RAMsin Figure 5(c) is approximately 82.03% of the total area, or3.38mm2. The ROM takes approximately 10.25%. On theother hand, the PE and control logics (including MUXs)require only 2.13% and 0.71%, respectively. The data showthat the control logic is very simple for this implementation,while the area of PE is not significant.

The goal of the present paper is to develop a low costFFT processor for DVB-T applications. Higher work fre-quency and smaller area were set to the highest priority.

CLK ma m1 m2 mb m3 m4

1 0 x xx 0 10 x2 0 x xx 0 10 x3 0 x xx 0 10 x4 0 x xx 0 10 x5 1 0 10 0 01 06 1 0 10 0 01 07 1 0 10 0 00 18 1 0 10 0 00 19 x 0 00 0 01 010 x 0 00 0 00 111 x 1 01 0 01 012 x 1 01 0 00 113 0 0 00 1 10 x14 0 1 01 1 10 x15 0 0 00 1 10 x16 0 1 01 1 10 x

CLK

inpu

t RA/WA R1 R2 R1 R21 2 DO DO DI DI

1 x0 00 00 x02 x1 01 01 x13 x2 10 10 x24 x3 11 11 x35 x4 00 00 x0 b0 b46 x5 01 01 x1 b1 b57 x6 10 10 x2 b6 b28 x7 11 11 x3 b7 b39 00 10 b0 b2 a0 a2

10 01 11 b1 b3 a3 a111 10 00 b6 b4 a6 a412 11 01 b7 b5 a5 a713 y0 00 11 a0 a114 y1 01 10 a3 a215 y2 10 01 a6 a716 y3 11 00 a5 a4

CLK Phase ma m3 m4 m1 m2 mb1-4 0 0 00 x x xx 05-8 1 1 w1w1 w1 0 10 0

9-12 2 x w0w0 w0 w1 0w1 013-16 3 0 00 0 w0 0w0 1

Counter=(w1w0) Address R_en ROMR1 R2 R1 R2 R-en Addr

w1w0 xx 1 1 1 xxw1w0 w1w0 0 1 0 w1w0w1w0 w1w0 0 0 0 w0 0w1w0 w1w0 0 0 0 0 0

Figure 4. MBFFT [11]: (a) Control Signals for MUXs;(b) for Memory Access; (c) Signal Flow Graph; and (d)&(e) Simplified Version of Control Signals.

(a) (b)

(c)

(d) (e)

01234567

0123

01674523

03652147

03651274

RA

M1

RA

M2

0001101100011011

SRAM4K x 24

SRAM4K x 24

RO

M

SRAM4K x 24

SRAM4K x 24

RO

M

and (c) Layout View of MBFFT with N=8192.

(c)

Page 4: [IEEE 2007 IEEE International Conference on Electro/Information Technology - Chicago, IL, USA (2007.05.17-2007.05.20)] 2007 IEEE International Conference on Electro/Information Technology

IEEE EIT 2007 Proceedings 348.

III. Proposed FFT ProcessorsOne way to improve the area of the MBFFT processor in

Figure 5(c) is the use of single-port memory. Table 1 com-pares the synthesized results obtained by Artisan tool forboth single-ported and dual-ported RAMs.

Table 1: Comparison for Dual-port vs. Single-port(24-bits/word)

Apparently, the use of single-port memory has better per-formance in both area and speed. However, at each cycle ofthe FFT operation in Figure 3(a), we need to perform theoperations of reading data from the memories and writingdata to memories. Both operations can be achieved by adual-ported memory within one cycle. However, it may taketwo cycles for single-ported memory to perform the sameoperations. In order to reduce the cycle time, we develop analternative design which employs the single-port memorywith pre-fetch registers to accomplish the memory accesswithin one cycle.

Figure 6(a) shows the proposed MBFFT processor. Twopre-fetch registers are inserted. Basically, the FFT operationis similar to that in Figure 3(a). The processor can be dividedinto two stages: (a) data storage; and (b) data process. Theright-hand side of Figure 6(b) is the data storage stage, whilethe left-hand side is the data processing stage. In fact, bothstages can be performed in parallel. Figure 6(c) shows theclock control signals of the FFT operation. In this imple-

mentation, this FFT control signal is also used for synchro-nizing the data process stage. On the other hand, the FFTclock signal is divided into two sub-cycles, as shown in Fig-ure 6(d). The first sub-cycle is used to control the operationthat stores the processed data to the memory, where thememory is controlled by the positive-edge of the control sig-nal. At the same time, the data stored in the pre-fetch regis-ter is also available for the PE to use. At the positive-edge ofthe control signal in the second sub-cycle, data is read frommemory and is ready for the pre-fetch register to fetch.

Based on the proposed clock scheme, the use of single-ported memory with the pre-fetch registers can achieve thesame task as that for the use of dual-ported memory. At thesame time, the clock cycle is reduced significantly, wherethe worst-case delay is reduced to that in the data processstage.

The MBFFT in Figure 6(a) has also been developed andits layout view is given in Figure 7.

Experimental results show that the area is reduced from4.12mm2 for the 8K MBFFT with dual-ported memories to2.04mm2. The area reduction is almost 50%. The maximumworking frequency is increased from 115MHz in Figure 5(c)

Area (mm2) 8K 2K 1K 512 256

Dual-port 2.044 1.439 1.390 1.295 1.278Single-port 1.635 1.237 1.205 1.204 1.143

Delay (ns) 8K 2K 1K 512 256Dual-port 2.75 0.84 0.48 0.31 0.20Single-port 1.20 0.37 0.23 0.14 0.09

RAM

RAM

RO

M RAM

RAM

RO

M

Figure 7. Layout View of the Proposed MBFFT.

RAM-1read

writeRAM-2read

write

Output

ROM

Input Data

0 1 01 00 10

0 1 1 0

000110

01

m1 m2

m3

m4mb

PE+

++

_

ma 0 1

Prefetch Prefetch

Output

ROM

Input Data

0 1 1 0

000110

01

m3

m4mb

PE+

++

_

ma 0 1

PrefetchRAM-1

read

writeRAM-2

read

write

0 1 01 00 10m1 m2

Prefetch

W R

PE

W R

PE

W R

PE

W

R

PE

Write Processed Data to Memory

Read Data from Memory

Process Data

Buffer Data is Available

Figure 6. MBFFT with Single-port Memories: (a) Schematic; (b) Parallel Structure;and (c) & (d) Timing and Clock Control Signals.

(a) (b)

(c)

(d)

Page 5: [IEEE 2007 IEEE International Conference on Electro/Information Technology - Chicago, IL, USA (2007.05.17-2007.05.20)] 2007 IEEE International Conference on Electro/Information Technology

IEEE EIT 2007 Proceedings 349.

to 198MHz. The performance improvement is significant.For OFDM applications, the speed performance of the

FFT processor is determined by the work frequency and thelatency. In the proposed architecture, the latency is (N/4)+(N/2)log(N) for N points FFT operations. In general, thelatency can be improved by using more PEs.

As mentioned, the area of a PE in Figure 3(c) is approx-imately 2.13% of the overall core area, and it is not signifi-cant comparing with the area of memories. Thus, we mayuse two PEs as shown in Figure 7(a), and the associated sig-nal flow is given in Figure 7(b) for N=16. The circuit in Fig-ure 7(a) is also implemented. The experimental results showthat the core area is 2.33mm2. for N=8192 points (24 bits perword). The maximal work frequency is 185 MHz. The areaof the circuit increases approximately 14.2% comparingwith the MBFFT with one PE. In this implementation, forN=8192, four 2K RAMs are employed. Note that bothapproaches employs a total of 8K RAMs. However, the totalarea of four 2K RAMs is approximately 10% more than thatof two 4K RAMs. The area increase is due to the PE andcontrol logics are not significant. Moreover, the latency ofthe MBFFT with 2 PE is reduced to (3N/4)+(N/4)logN. Inother words, the latency is reduced approximately by a half.Further, for the MBFFT with 4 PE, the core area is approx-imately 3.37 mm2, the working frequency is 162MHz, andits latency is N+(N/8)lognN.

In summary, for the MBFFT with N=8192 points (24bits per word), from the layout data, the experimental dataare tabulated in Table 2.

IV. ConclusionThis paper compares the MBFFT designs using dual-

ported memories and single-ported memories. Results showthat the area of using dual-ported memory is about double

that of single-ported memory. Thus, a clock scheme is pro-posed to use single-ported memory with pre-fetch register toreduce the hardware cost. Due to the insignificant area of PEin a MBFFT processor, this paper also demonstrates the per-formances of area, maximal work frequency, and latency ofthe MBFFTs with one PE, two PEs, and four PEs. As listedin Table 2, the MBFFT processor with one PE has the small-est area and highest work frequency, but with the lowestlatency. In many OFDM applications, the architecture ofFFTs depends upon the speed, latency, and power consump-tion. The higher work frequency, the more power consump-tion. However, it will process more data.

This paper demonstrates the design tradeoffs andassumes the FFT with the simple radix-2 structure. In fact,the increase of PEs and/or the use of higher radix mayimprove the design performance at the cost of increase thedesign complexity.

AcknowledgmentThis work was supported in part by the Taiwan National Science

Council under the grant numbers NSC94-2220-E-008-001,NSC94-2220-E-008-008, and in part by Elan MicroelectronicsCorp., Taiwan.

Table 2. Performance of Proposed MBFFTN=8192 Area

(mm2)Max. Work

Freq. (MHz)Latency

with 1 PE 2.04 198 55,296with 2 PEs 2.33 185 32,768with 4 PEs 3.37 162 21,504

RAM 2Kx24

RAM 2Kx24

RAM 2Kx24

RAM 2Kx24

ROM

RAM 2Kx24

RAM 2Kx24

RAM 2Kx24

RAM 2Kx24

ROM

0123456789101112131415

01234567

0123456789101112131415

0167452389141512131011

0365214781114131091215

0365127481114139101512

RA

M1

RA

M2

RA

M3

RA

M4

00011011000110110001101100011011

RAM-1read

writeRAM-2

read

write

Output

ROM

Input Data

P E

RAM-3read

writeRAM-4

read

write

Output

P E

0 1 01 00 10

0 1 1 0

000110

000110

0 1 1 0

000110

000110

m11 m12 m22

m13 m14

mbm23 m24

0 1m21

01 00 10

mc1mc2

0 1

01

Prefetch Prefetch Prefetch Prefetch

Figure 7. Proposed MBFFT with 2 PEs: (a) Schematic; (b) Signal Path Flow with N=16; and (c) Layout View of MBFFT with N=8192.

(24 bits per word)

Page 6: [IEEE 2007 IEEE International Conference on Electro/Information Technology - Chicago, IL, USA (2007.05.17-2007.05.20)] 2007 IEEE International Conference on Electro/Information Technology

IEEE EIT 2007 Proceedings 350.

References1. R. van Nee and R. Prasad, OFDM for Wireless Multimedia

Communications, Artech House, 2000.2. W.-Y. Zou and Y. Wu, "COFDM: an Overview," IEEE Trans.

on Broadcasting, pp.1-8, March 1995.3. G. Bi and E.V. Jones, “A Pipelined FFT Processor for Word-

Sequence Data,” IEEE Trans. on Acoustics, Speech, andSignal Processing, Vol.37, pp.1982-1985, December 1989.

4. S. He and M. Torkelson, “Designing Pipeline FFT Processorfor OFDM (de)modulation,” Proc. of InternationalSymposium on Signals, Systems, and Electronics (ISSSE),pp.257-262, 1998.

5. Y.N. Chang and K.K. Parhi, “An Efficient Pipelined FFTArchitecture,” IEEE Trans. on Circuits and Systems II:Analog and Digital Signal Processing, Vol. 50, pp.322-325,June 2003.

6. L. Jia, Y. Gao, J. Isoaho, and H. Tenhunen, “A New VLSI-oriented FFT Algorithm and Implementation,” Prco. of 11thAnnu. IEEE International ASIC conference, pp.337-341,September 1998

7. E. Bibetm D. Castelain, C. Joanblanq, and P. Senn, “A FastSingle-chip Implementation of 8192 Complex Points FFT,”IEEE Journal of Solid-State Circuits, Vol.20, pp.205-300,March 1995.

8. J. Lee, H. Lee, S-I Cho, and S.-S. Choi, “A High-speed Low-Complexity Radix-24 FFT Processor for MB-OFDM UWBSystems,” Proc. of IEEE International Symp. on Circuits andSystems, pp.4719-4722, May 2006.

9. Y.-W. Lin, H.-Y. Liu, and C.-Y Lee, “A Dynamic Scaling FFTProcessor for DVB-T Applications,” IEEE Journal of Solid-State Circuits, Vol.39, pp.2005-2013, November 2004.

10. C.M. Wu, M.D. Shieh, H.F. Lo, and M.H. Hu,“Implementation of channel demodulator for DAB Systems,”2003 IEEE International Symposium on Circuits andSystems, vol. 2, pp.25-28, May 2003.

11. C.-K. Chang, C.-P. Hung, and S.-G. Chen, “An EfficientMemory-based FFT Architecture,” Proc. of IEEEInternational Symposiums on Circuits and Systems, pp.129-132, 2003.

12. C.L. Wey, W.-C. Tang, and S.Y. Lin, “Efficient Memory-Based FFT Architectures for Digital Video Broadcasting(DVB-T/H),” Proc. of International Symp. on VLSI Design,Automation, and Test (VLSI-DAT), Hsinchu, Taiwan, April,2007.

13. C.L. Wey, S.-Y. Lin, and W.-C. Tang, “Efficient VLSIImplementation of Memory-Based FFT Processors for DVB-T Applications,” Proc. of IEEE Computer Society AnnualSymposium on VLSI (IVLSI), Porto Alegre, Brazil, May2007.


Recommended