Eff Arch for Dit Fft

IEEE TRANSACTIONS ON CIRCUIT AND SYSTEM-I, REGULAR PAPERS 1

Efficient VLSI Architecture for Decimation-in-TimeFast Fourier Transform of Real-Valued Data

Pramod K. Meher, Senior Member, IEEE, Basant K. Mohanty, Senior Member, IEEE,Sujit K. Patel, Soumya Ganguly, Thambipillai Srikanthan Senior Member, IEEE

Abstract—The decimation-in-time (DIT) fast Fourier trans-form (FFT) very often has advantage over the decimation-in-frequency (DIF) FFT for most real-valued applications, likespeech/ image/ video processing, bio-medical signal processing,and time-series analysis, etc., since it does not require any outputreordering. Besides, the DIT FFT butterfly involves less compu-tation time than its DIF counterpart. In this paper, we presentan efficient architecture for the radix-2 DIT real-valued FFT(RFFT). We present here the necessary mathematical formulationfor removing the redundancies in the radix-2 DIT RFFT, andpresent a formulation to regularize its flow graph to facilitatefolded computation with a simple control unit. We proposehere a register-based storage design which involves significantlyless area at the cost of a little higher latency compared withthe conventional RAM-based storage. The address generationfor folded in-place DIT RFFT computation with register-basedstorage is challenging since both read and write operationsare performed in the same clock cycle at different locations.Therefore, we present here a simple formulation of addressgeneration for the proposed radix-2 DIT RFFT structure. Theproposed structure involves ∼ 61% less area and ∼ 40% lesspower consumption than those of [8], on average, for FFT sizes16, 32, 64, and 128. It involves ∼ 70% less area-delay productand ∼ 57% less energy per sample than those of the other, onaverage, for the same FFT sizes.

Index Terms—Fast Fourier transform, FFT, in-place compu-tation, real-valued FFT, decimation-in-time FFT

I. INTRODUCTION

THE fast Fourier transform (FFT) algorithm is frequentlyencountered in almost every application area of digital

signal processing. There are several applications such asspeech, audio, image, and video processing, where FFT is veryoften performed on real-valued signals. Efficient realizationof FFT of real-valued signals has received further attentionnowadays due to the emergence of biomedical signal process-ing, and wide applications of real-valued time-series analysis.

Manuscript submitted on 08 May 2015, revised August 9, 2015, andSeptember 20, 2015. This paper was recommended by Associate Editor HakanJohansson.

P. K. Meher and T. Srikanthan are with School of Computer Engineering,Nanyang Technological University, Singapore, Email: {aspkmeher,astsrikan}@ntu.edu.sg.

B. K. Mohanty and S. K. Patel are with Department of Electron-ics and Communication Engineering, Jaypee University of Engineeringand Technology, Raghogarh, Guna, Madhy Pradesh, India-473226, Email:[email protected],[email protected].

S. Ganguly was with EEE Department, BITS, Hyderabad, Andhra Pradesh,India. He worked in the School of Computer Engineering, Nanyang Tech-nological University, Singapore, during June 2014 to November 2014.Email:[email protected]

Copyright (c) 2015 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending an email to [email protected].

DIF FFT Unit ou

tput

re

orde

ring # m

samples per clock cycle

# m samples

per clock cycle

DIT FFT Unit in

put

reor

derin

g # m samples per clock cycle

# m samples

per clock cycle

ut ing# m

samples# m

samples t ing# m

samples# m

samplesDIF FFT Unit ou

tpu

reorde

rsamples per clockcycle

samples per clock

cycle

DIT FFT Unitin

put

reorde

rsamples per clockcycle

samples per clock

cycle

(a) (b)

-

x1

x2

y1

y2

Wk Wk

-

x1

x2

y1

y2 -

x1

x2

y1

y2

Wk Wk

-

x1

x2

y1

y2

(c) (d)

Fig. 1. The key differences in the DIT and DIF FFT computation. (a) DITFFT processing. (b) DIF FFT processing. (c) DIT FFT butterfly. (d) DIF FFTbutterfly. m is a power-of-2 integer.

FFT of real-valued signal exhibits conjugate symmetry whichrenders half the FFT outputs redundant [1]. The conjugatesymmetry of real-valued FFT (RFFT) is used to compute FFTsof a pair of N -point real-valued sequences from N -point FFTof a complex-valued data1. There has been a continued effortfor several decades on reducing the hardware cost of VLSI im-plementation of FFT and improving its performance [2]. Someefforts have been made for efficient VLSI implementation ofRFFT, as well.

Some folded pipeline architectures have been proposed forthe computation of RFFT [3], [4], where butterfly operationsare multiplexed into a small logic unit. The structures in[3] and [4] could provide adequate throughput for someapplications but the storage complexity of those structurescontinues to be very high. A few in-place architectures havealso been proposed for RFFT using specialized packing al-gorithms [5], [6]. Memory-conflict for read/write operation isfound to be the major challenge in the design of algorithmsand architectures for in-place computation [7]. Recently, an in-place architecture and conflict-free memory addressing schemehave been proposed for continuous processing of RFFT [8].

The FFT algorithms are classified into two broad categories,namely, the decimation-in-time (DIT) and the decimation-in-frequency (DIF) algorithms. The key differences between thetwo are shown in Fig.1. In case of DIF algorithm (Fig.1(a)),the input samples are fed to the computing structure in their

1This is sometimes referred to as ‘Bach and Rock’ approach since one canfind FFTs of two independent streams of real-valued data, e.g., a classicaltrack by J. S. Bach and the music by a rock group using this approach. Basedon this approach open source FFT libraries are developed by the Center forAstronomy Signal Processing and Electronics Research (CASPER) of UCBerkeley ( https://casper.berkeley.edu/).


TABLE ICOMPUTATIONAL DELAY OF MULT-ADD AND ADD-MULT OPERATIONS

BASED ON 65 NM CMOS TECHNOLOGY LIBRARY IN NS.

input size TMA TAM TAM-TMA % Difference8-bit 1.24 1.30 0.06 4.84%

16-bit 2.29 2.48 0.19 8.30%

24-bit 3.52 3.82 0.30 8.52%

TMA is the computation time of multiplication followed by addition.TAM is the computation time of addition followed by multiplication.

natural order, while the output are generated in bit-reversedorder. On the other hand, in case of DIT algorithm (Fig.1(b)),the input samples need bit-reversal reordering before beingprocessed, while the output FFT coefficients are generated innatural order. In different RFFT applications such as imageand video processing, biomedical signal processing, and time-series analysis etc., the complete input sequence is generallyavailable together at the same time for the FFT computation.The DIT RFFT has an advantage over the DIF form for theseapplications, since a DIT RFFT structure need not wait for thearrival of input samples but can produce the outputs as soonas those are computed.

As shown in Figs.1(c) and 1(d) the DIF butterfly involves anaddition followed by a multiplication while the DIT butterflyinvolves a multiplication followed by additions. As shown inTable I the computation time of fixed-point multiplication fol-lowed by an addition is less than that of addition followed bya multiplication. The DIT-based RFFT butterfly thus involvesless propagation delay than that of DIF-based RFFT butterflyalthough both these butterflies involve the same number ofmultipliers and adders. Therefore, the choice of DIT algorithmto derive RFFT structure has an advantage over DIF algorithm.In this paper, we present an efficient architecture for the DITradix-2 RFFT algorithm. The main contributions of this paper,as discussed in the next three Sections of the paper, are:

1) Mathematical formulation of the radix-2 DIT RFFTalgorithm using real-valued arithmetic.

2) Derivation of a regularized flow graph for folded com-putation of radix-2 DIT RFFT using a simple control.

3) Formulation of address generation for folded in-placeradix-2 DIT RFFT algorithm and derivation of proposedRFFT structure using register-based storage.

Hardware and time complexities along with performancecomparison are presented in Section V. Conclusions are pre-sented in Section VI.

II. REAL-VALUED MATHEMATICAL FORMULATION ANDDERIVATION OF FLOW GRAPH FOR RADIX-2 RFFT

A. Radix-2 DIT RFFT using Real-valued Arithmetic

The DFT of an N -point sequence {x(n)} is given by

X(k) =

N−1∑n=0

x(n) · e−j 2πknN (1)

for k, n = 0, 1, · · · , N − 1. When N is an even integer,the input sequence {x(n)} could be decomposed into twosub-sequences {x0(m)} and {x1(m)} of size (N/2) each,

comprised of the even-indexed and odd-indexed elements of{x(n)} respectively, such that x0(m) = x(2m) and x1(m) =x(2m+1) for 0 ≤ m ≤ (N/2)−1. Using the time-decimatedsubsequences {x0(m)} and {x1(m)}, (1) can be written as

X(k) =

(N/2)−1∑m=0

[x0(m) · e−j

2πk(2m)N + x1(m) · e−j

2πk(2m+1)N

](2)

for k, n = 0, 1, · · · , N −1. Equation (2) then can be split intoreal and imaginary parts as

X(k) = [A0(k) +D(k)]− j[B0(k) + E(k)] (3a)X(k +N/2) = [A0(k)−D(k)]− j[B0(k)− E(k)] (3b)

where

D(k) = A1(k) cos(2πk/N)−B1(k) sin(2πk/N) (4a)E(k) = B1(k) cos(2πk/N) +A1(k) sin(2πk/N) (4b)

Ai(k) =

(N/2)−1∑m=0

xi(m) cos(2πkmN/2

)(4c)

Bi(k) =

(N/2)−1∑m=0

xi(m) sin(2πkmN/2

)(4d)

for k = 0, 1, 2, · · · , (N/2) − 1. Note that Ai(k) and Bi(k)are, respectively, the real and negative imaginary parts of the(N/2)-point DFT of the sub-sequence {xi(m)}, for i = 0 and1. When N/2 is again an even integer, each of the pair of sub-sequences {x0(m)} and {x1(m)} can be further decomposedinto two sub-sequences corresponding to their even- and odd-indexed elements. When N is a power-of-2 integer, suchdecomposition could continue recursively to compute an N -point RFFT using a DIT flow graph of P = log2N stages ofbutterfly (BF) computation. Real and imaginary parts of thep-th stage (for 1 ≤ p ≤ P ) of DIT-RFFT {Ap(k), Bp(k)} canbe computed from the real and imaginary parts of (p− 1)-thstage of the BF output using the following relations:

Ap0(k) = Aq

0(k) +Dp0(k), B

p0 (k) = Bq

0(k) + Ep0 (k),

Ap0(k

′) = Aq0(k)−Dp

0(k), Bp0 (k

′) = Bq0(k)− Ep

0 (k),

Ap1(k) = Aq

0(k′) +Dp

1(k), Bp1 (k) = Bq

0(k′) + Ep

1 (k),

Ap1(k

′) = Aq0(k

′)−Dp1(k), B

p1 (k

′) = Bq0(k

′)− Ep1 (k),

Dp0(k) = Aq

1(k)ck −Bq1(k)sk, E

p0 (k) = Bq

1(k)ck +Aq1(k)sk,

Dp1(k) = Aq

1(k′)ck −Bq

1(k′)sk,

Ep1 (k) = Bq

1(k′)ck +Aq

1(k′)sk, (5)

where q = p − 1, k′ = k + (N/2p′), ck = cos(2p

′πk/N),

sk = sin(2p′πk/N), for 0 ≤ k ≤ (N/2p

′)−1, p′ = P −p+1,

1 ≤ p ≤ P , and P = log2N . Note that AP0 (k) and BP

0 (k)are, respectively, the real and the negative imaginary parts ofN -point DFT sequence {X(k)}. The radix-2 RFFT given by(5) is optimized further using the symmetric property of real-valued DFT as discussed in the following Subsection.


B. Optimized Radix-2 DIT Real-valued FFT

For a real-valued N -point sequence {x(n)}, X(0) andX(N/2) are real-valued since their imaginary part is zero,and X(N − k)∗ = X(k). Therefore, A(k) = A(N − k)and B(k) = −B(N − k). Due to redundancies resultingfrom the symmetric and anti-symmetric behaviour of A(k)and B(k), only half the number of DFT components needto be computed at each stage of BF computations. Further,the sine and the cosine functions of (5) are, respectively,symmetric and anti-symmetric for k and [(N/2p

′)− k] where

0 ≤ k ≤ (N/2p′)− 1, for p′ = P − p+ 1. Using this feature,

the radix-2 RFFT given by (5) is optimized and generalizedexpressions for the computation of real parts {Ap(k)}, andthe negative imaginary-parts {Bp(k)} of the p-th BF stage ofRFFT are obtained as follows:

Ap0(k) = Aq

0(k) +Dp0(k), A

p0(M1) = Aq

0(k)−Dp0(k),

Bp0 (k) = Bq

0(k) + Ep0 (k), B

p0 (M1) = Ep

0 (k)−Bq0(k),

Ap0(M/2) = Aq

0(M/2), Bp0 (M/2) = Aq

1(M/2),

Ap0(0) = Aq

0(0) +Aq1(0), A

p0(M) = Aq

0(0)−Aq1(0),

Ap1(k) = Aq

0(M2) +Dp1(k), A

p1(M1) = Aq

0(M2)−Dp1(k),

Bp1 (k) = Bq

1(M2) + Ep1 (k), B

p1 (M1) = Ep

1 (k)−Bq1(M2),

Ap1(M/2) = Aq

0(M +M/2), Bp1 (M/2) = Aq

1(M +M/2),

Ap1(0) = Aq

0(M) +Aq1(M), Ap

1(M) = Aq0(M)−Aq

1(M),

Bp0 (0) = Bp

0 (M) = Bp1 (0) = Bp

1 (M) = 0. (6)

for 1 ≤ k ≤ M − 1 except k = M/2, where q = p −1,M = N/2p

′, M1 = M − k,M2 = M + k, p′ = P − p +

1, 1 ≤ p ≤ P , and P = log2N . We may define the real-valued sequence {x(n)} = {A0

0(k)} = {xP0 (m)}; {A01(k)} =

{xP1 (m)}, B00(k) = 0, B0

1(k) = 0, for 0 ≤ (m, k) ≤ (N/2)−1, where {xP0 (m)} and {xP1 (m)} are, respectively, the even-and the odd-indexed sub-sequences obtained after the P -thtime-decimation of sequence of {x(n)}. Using (6), the flowgraph for 16-point real-valued FFT is derived and shown inFig.2. As shown in Fig.2, the flow graph corresponding to theoptimized RFFT (referred to as optimized flow graph) involves25 BFs (2-point) while the original flow graph involves 49such BFs. In general, the number of 2-point BFs of originalflow-graph (α) and optimized flow-graph (β) of N -point real-valued FFT can be calculated using the following relations.

α = N[54+

7

8+

15

16+ · · ·+ (2P − 1)/2P

](6a)

β = N[34+

3

8+

7

16+ · · ·+ (2P−1 − 1)/2P

](6b)

The optimized flow graph is not regular and the numbersof BFs involved in different stages of computation are notthe same. For example, stage-1, stage-2, stage-3, and stage-4 in Fig.2 involve (8, 4, 6, and7) BFs, respectively. Fromthe optimized flow graph of DIT-RFFT of various sizes, weobserve that intermediate result E of stage-2 onwards passthrough type-II BFs whereas rest of the inputs and intermediateresults are processed by type-I BFs. The irregular distributionof type-I and type-II BFs of the optimized flow graph requiresa complex control unit particularly when the BF operations

A0(0) 1

A0(1) 1

A1(0) 1

A1(1) 1

A1(2) 1

A1(3) 1

A1(4) 1

A1(5) 1

A1(6) 1

A1(7) 1

A0(2) 1

A0(3) 1

A0(4) 1

A0(5) 1

A0(6) 1

A0(7) 1

A0(0) 2

A0(1) 2

A0(2) 2

B0(1) 2

A1(0) 2

A1(1) 2

A1(2) 2

B1(1) 2

A0(4) 2

A0(5) 2

A0(6) 2

B0(5) 2

A1(4) 2

A1(5) 2

A1(6) 2

B1(5) 2

A0(0) 3

A0(1) 3

A0(2) 3

B0(1) 3

A0(4) 3

A0(3) 3

B0(2) 3

B0(3) 3

A1(0) 3

A1(1) 3

A1(2) 3

B1(1) 3

A1(4) 3

A1(3) 3

B1(2) 3

B1(3) 3

{c2,s2}

{c2,s2}

D0(1)

E0(1)

{c2,s2}

{c2,s2}

{c1,s1}

{c1,s1}

D0(1)

E0(1)

{c2,s2} D0(2)

{c2,s2} E0(2)

{c3,s3} D0(3)

E0(3)

A0(0) 4

A0(1) 4

A0(2) 4

B0(1) 4

A0(4) 4

A0(3) 4

B0(2) 4

B0(3) 4

4 A0(8)

A0(7) 4

A0(6) 4

B0(7) 4

B0(4) 4

A0(5) 4

B0(6) 4

B0(5) 4

x(0)

x(8) x(4)

x(12) x(2)

x(10) x(6)

x(14)

x(1)

x(9) x(5)

x(13) x(3)

x(11) x(7)

x(15) {c3,s3}

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ - +

-

+ -

+ -

+ -

+ -

+ -

+ - +

-

+ -

Stage-1 Stage-2 Stage-3 Stage-4

3

3

D1(1)

E1(1)

3

3

+ -

+ +

+ -

+ +

4

4

4

4

4

4

+ -

+ -

+ +

+ +

+ +

+ -

Fig. 2. Optimized flow graph of 16-point DIT real-valued FFT. cl =cos(2πl/16) and sl = sin(2πl/16), where l = 1, 2, and 3

+ +

+ -

x1

x2

y1

y2

+ +

-

+

x1

x2

y1

y2

(a) (b)

+

+

-

x1

x3

y1

y3

-

+ x4 y4

+

+ x2 y2

+

(c)

Fig-3

Fig. 3. (a) Type-I butterfly operation. (b) Type-II butterfly operation. (c) 4-point butterfly operation.

are mapped onto a folded structure. Moreover, the conflict-free memory access is another issue associated with the in-place computation by a folded structure. The optimized flowgraph therefore requires highly complex control due to itsirregular data-flow. To resolve the these issues, we present herea modified flow graph (MFG), which could lead to a simplifiedcontrol structure for the conflict-free memory access for thein-place computation by a folded structure.

III. REGULARIZATION OF FLOW GRAPH AND CONTROLDESIGN FOR RFFT COMPUTATION

From the optimized flow graph of 16-point RFFT (Fig.2),we find that stage-1, stage-2, stage-3, and stage-4, respectivelyinvolve (8, 4, 4, and 4) type-I BFs and (0, 0, 2, and 3)type-II BFs. Similarly, stage-1, stage-2, stage-3, stage-4, andstage-5 of the optimized flow graph of 32-point DIT RFFT,respectively, involve (16, 8, 8, 8, and 8) type-I BFs and (0, 0, 4,6, and 7) type-II BFs. Interestingly, the numbers of type-I andtype-II BFs are nearly equal after stage-2 onwards. Therefore,we use a 4-input hybrid BF comprised of a type-I BF and atype-II BF as the basic building block to derive a regular flowgraph. The input pattern of the original flow graph (Fig.2)is modified to match with the input pattern of 4-input BF[shown in Fig.3(c)]. The BFs of stage-3 and stage-4 are alsoreordered accordingly. A few redundant arithmetic operationsare introduced to transform the original flow graph to a regularform without changing intermediate signal values. The MFGof 16-point DIT RFFT is shown in Fig.4. The dashed-linesin Fig.4 represent absence of signal-path. As shown in Fig.4,each stage of MFG involves four 4-point BFs.

A. Regularization of Flow graph for Real-valued FFT

The data-flow of the MFG for in-place computation ofRFFT is shown in Fig.5. The index of input sequence {x(n)}


A0(0) 1

A1(0) 1

A0(1) 1

A1(1) 1

x(0)

x(12)

x(8)

x(4)

{c0,s0}

{c0,s0}

D1(0)

E1(0)

x(2)

x(14)

x(10)

x(6)

{c0,s0}

{c0,s0}

D1(1)

E1(1)

x(1)

x(13)

x(9)

x(5)

{c0,s0}

{c0,s0}

D1(2)

E1(2)

x(3)

x(15)

x(11)

x(7)

{c0,s0}

{c0,s0}

D1(3)

E1(3)

A0(2) 1

A1(2) 1

A0(3) 1

A1(3) 1

A0(4) 1

A1(4) 1

A0(5) 1

A1(5) 1

A0(6) 1

A1(6) 1

A0(7) 1

A1(7) 1

{c0,s0}

{c0,s0}

D2(0)

E2(0)

{c0,s0}

{c0,s0}

D2(1)

E2(1)

{c0,s0}

{c0,s0}

D2(2)

E2(2)

{c0,s0}

{c0,s0}

D2(3)

E2(3)

A0(0) 2

A0(2) 2

A0(1) 2

B0(1) 2

A1(0) 2

A1(2) 2

A1(1) 2

B1(1) 2

A0(4) 2

A0(6) 2

A0(5) 2

B0(5) 2

A1(4) 2

A1(6) 2

A1(5) 2

B1(5) 2

{c0,s0}

{c0,s0}

D3(0)

E3(0)

{c1,s1}

{c1,s1}

D3(1)

E3(1)

{c0,s0}

{c0,s0}

D3(2)

E3(2)

{c1,s1}

{c1,s1}

D3(3)

E3(3)

A0(0) 3

A0(4) 3

A0(2) 3

B0(2) 3

A0(1) 3

B0(1) 3

A0(3) 3

B0(3) 3

A1(0) 3

A1(4) 3

A1(2) 3

B1(2) 3

A1(1) 3

B1(1) 3

A1(3) 3

B1(3) 3

{c0,s0}

{c0,s0}

D4(0)

E4(0)

{c1,s1}

{c1,s1}

D4(1)

E4(1)

{c2,s2}

{c2,s2}

D4(2)

E4(2)

{c3,s3}

{c3,s1}

D4(3)

E4(3)

A0(0) 4

A0(8) 4

A0(4) 4

B0(4) 4

A0(1) 4

B0(1) 4

A0(7) 4

B0(7) 4

A0(2) 4

B0(2) 4

A0(6) 4

B0(6) 4

A0(3) 4

B0(3) 4

A0(5) 4

B0(5) 4

- +

- +

- +

- +

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

+ -

- +

- +

- +

- +

- +

- +

- +

- +

- +

- +

- +

- +

4-input BF unit

Fig. 4. Modified flow graph (MFG) of 16-point DIT RFFT. The dashed line represent no signal path.

BF-1 (c0,s0)

x(0) x(12) x(8)

x(4)

u(0) u(8) u(4) u(12)

BF-2 (c0,s0)

x(2) x(14) x(10) x(6)

u(2) u(10) u(6) u(14)

BF-3 (c0,s0)

x(1) x(13) x(9)

x(5)

u(1) u(9) u(5) u(13)

BF-4 (c0,s0)

x(3) x(15) x(11) x(7)

u(3) u(11) u(7) u(15)

BF-5 (c0,s0)

v(0) v(4) v(8)

v(12)

BF-6 (c0,s0)

v(2) v(10) v(6)

v(14)

BF-7 (c0,s0)

v(1) v(5) v(9)

v(13)

BF-8 (c0,s0)

v(3) v(7) v(11)

v(15)

Dat

a R

eord

erin

g v(0) v(4) v(2) v(6)

v(8) v(12) v(10) v(14)

v(1) v(5) v(3) v(7)

v(9) v(13) v(11) v(15)

BF-9 (c0,s0)

w(0) w(2) w(4) w(6)

BF-10 (c2,s2)

w(8)

w(14)

BF-11 (c0,s0)

w(1) w(3) w(5) w(7)

BF-12 (c2,s2)

w(9)

w(15)

w(12) w(10)

w(13) w(11)

Dat

a R

eord

erin

g

w(0) w(2) w(1) w(3)

w(8)

w(13)

w(4) w(6) w(5)

w(7)

w(10)

w(15)

w(12) w(9)

w(14) w(11)

BF-13 (c0,s0)

BF-14 (c1,s1)

BF-15 (c2,s2)

BF-16 (c3,s3)

z(0) z(1) z(2) z(3)

z(8)

z(13)

z(4) z(6) z(5)

z(7)

z(10)

z(15)

z(12) z(9)

z(14) z(11)

Fig. 5. DFG for in-place computation of 16-point decimation-in-time real-valued FFT.

is in natural order and indices of all the intermediate andoutput signal match with the requirement of folded in-placecomputation. The horizontal and vertical dashed-lines of Fig.5represent the possible folding of the MFG. Each small rect-angular box of Fig.5 represents a 4-point BF. The MFG of16-point RFFT involves 16 number of 4-point hybrid BFs.The computation of those BFs are scheduled in 16 successiveclock cycles such that 4 BFs of each stage are scheduled in4 successive clock cycles. The 16 BF operations of Fig.5 arenumbered according to their schedule in a folded structure.Therefore, the computation of stage-1, stage-2, stage-3, andstage-4 of 16-point RFFT are folded and scheduled during1-4, 5-7, 8-11, and 13-16 clock cycles to be processed in a 4-point BF processing unit. In general, N -point RFFT involves[(N/4) log2N ] folded 4-point BF operations. Intermediatecoefficients v(k) and w(k) are reordered before they are sched-uled to be processed during the next stage of computation. Thisreordering requires changing the storage access order in caseof in-place computation, and does not involve any extra clockcycle. Therefore, the entire in-place computations of 16-pointRFFT MFG can be computed in 16 clock cycles using one4-input BF processing unit.

The MFG of the optimized 32-point radix-2 RFFT is shownin Fig.6. We find from this figure that from stage-4 onwards

the order of occurrence of twiddle factors {cl, sl} deviatefrom the natural order of their index ‘l′. In case of 32-point RFFT, the order of occurrence of twiddle factors interms of their index, l in stage-5 is {0, 1, 2, 3, 4, 7, 6, 5},where {cl, sl} := {cos(2πl/32), sin(2πl/32)}. Similarly, theorder of twiddle factors in stage-6 in case of 64-pointRFFT is {0, 1, 2, 3, 4, 7, 6, 5, 8, 15, 14, 13, 12, 9, 10, 11} where{cl, sl} := {cos(2πl/64), sin(2πl/64)}. After close observa-tion of the order of occurrence of twiddle factors in differentstages of the MFG of RFFT of sizes N = 8, 16, 32, 64, 128,and 256, we find that the twiddle factor indices satisfy aperiodic property. Using that periodic property, order of occur-rence of twiddle factors for p-th stage can be obtained easilyfrom those of (p − 1)-th stage. The order of twiddle-factorsof a given BF stage of MFG is obtained iteratively using thefollowing relations:

tpi = tp−1i and tp2p−3+i = 2p−2 − tp−1i (7)

for 1 ≤ i ≤ 2p−3 − 1, 1 ≤ p ≤ log2N , tp0 = tp−10 , andtp2p−3 = 2p−3. Using these relations, we can find the orderof occurrence of twiddle-factors in different BF stages ofRFFT of different sizes as shown in Table II. The sequenceof occurrence of twiddle factors of Table II, can be used to


A0(0) 1

A1(0) 1

A0(1) 1

A1(1) 1

x(0)

x(8)

x(16)

x(24)

{c0,s0}

{c0,s0}

D1(0)

E1(0)

x(4)

x(12)

x(20)

x(28)

{c0,s0}

{c0,s0}

D1(1)

E1(1)

x(2)

x(10)

x(18)

x(26)

{c0,s0}

{c0,s0}

D1(2)

E1(2)

x(6)

x(14)

x(22)

x(30)

{c0,s0}

{c0,s0}

D1(3)

E1(3)

A0(2) 1

A1(2) 1

A0(3) 1

A1(3) 1

A0(4) 1

A1(4) 1

A0(5) 1

A1(5) 1

A0(6) 1

A1(6) 1

A0(7) 1

A1(7) 1

{c0,s0}

{c0,s0}

D2(0)

E2(0)

{c0,s0}

{c0,s0}

D2(1)

E2(1)

{c0,s0}

{c0,s0}

D2(2)

E2(2)

{c0,s0}

{c0,s0}

D2(3)

E2(3)

A0(0) 2

A0(2) 2

A0(1) 2

B0(1) 2

A1(0) 2

A1(2) 2

A1(1) 2

B1(1) 2

A0(3) 2

A0(5) 2

A0(4) 2

B0(4) 2

A1(3) 2

A1(5) 2

A1(4) 2

B1(4) 2

{c0,s0}

{c0,s0}

D3(0)

E3(0)

{c1,s1}

{c1,s1}

D3(1)

E3(1)

{c0,s0}

{c0,s0}

D3(2)

E3(2)

{c1,s1}

{c1,s1}

D3(3)

E3(3)

A0(0) 3

A0(4) 3

A0(2) 3

B0(2) 3

A0(1) 3

B0(1) 3

A0(3) 3

B0(3) 3

A1(0) 3

A1(4) 3

A1(2) 3

B1(2) 3

A1(1) 3

B1(1) 3

A1(3) 3

B1(3) 3

{c0,s0}

{c0,s0}

D4(0)

E4(0)

{c1,s1}

{c1,s1}

D4(1)

E4(1)

{c2,s2}

{c2,s2}

D4(2)

E4(2)

{c3,s3}

{c3,s1}

D4(3)

E4(3)

A0(0) 4

A0(8) 4

A0(4) 4

B0(4) 4

A0(1) 4

B0(1) 4

A0(7) 4

B0(7) 4

A0(2) 4

B0(2) 4

A0(6) 4

B0(6) 4

A0(3) 4

B0(3) 4

A0(5) 4

B0(5) 4

A0(8) 1

A1(8) 1

A0(9) 1

A1(9) 1

x(1)

x(9)

x(17)

x(25)

{c0,s0}

{c0,s0}

D1(0)

E1(0)

x(5)

x(13)

x(21)

x(29)

{c0,s0}

{c0,s0}

D1(1)

E1(1)

x(3)

x(11)

x(19)

x(27)

{c0,s0}

{c0,s0}

D1(2)

E1(2)

x(7)

x(15)

x(23)

x(31)

{c0,s0}

{c0,s0}

D1(3)

E1(3)

A0(10) 1

A1(10) 1

A0(11) 1

A1(11) 1

A0(12) 1

A1(12) 1

A0(13) 1

A1(13) 1

A0(14) 1

A1(14) 1

A0(15) 1

A1(15) 1

{c0,s0}

{c0,s0}

D2(0)

E2(0)

{c0,s0}

{c0,s0}

D2(1)

E2(1)

{c0,s0}

{c0,s0}

D2(2)

E2(2)

{c0,s0}

{c0,s0}

D2(3)

E2(3)

A0(6) 2

A0(8) 2

A0(7) 2

B0(7) 2

A1(6) 2

A1(8) 2

A1(7) 2

B1(7) 2

A0(9) 2

A0(11) 2

A0(10) 2

B0(10) 2

A1(9) 2

A1(11) 2

A1(10) 2

B1(10) 2

{c0,s0}

{c0,s0}

D3(0)

E3(0)

{c1,s1}

{c1,s1}

D3(1)

E3(1)

{c0,s0}

{c0,s0}

D3(2)

E3(2)

{c1,s1}

{c1,s1}

D3(3)

E3(3)

A0(5) 3

A0(9) 3

A0(7) 3

B0(7) 3

A0(6) 3

B0(6) 3

A0(8) 3

B0(8) 3

A1(5) 3

A1(9) 3

A1(7) 3

B1(7) 3

A1(6) 3

B1(6) 3

A1(8) 3

B1(8) 3

{c0,s0}

{c0,s0}

D4(0)

E4(0)

{c1,s1}

{c1,s1}

D4(1)

E4(1)

{c2,s2}

{c2,s2}

D4(2)

E4(2)

{c3,s3}

{c3,s1}

D4(3)

E4(3)

A1(0) 4

A1(8) 4

A1(4) 4

B1(4) 4

A1(1) 4

B1(1) 4

A1(7) 4

B1(7) 4

A1(2) 4

B1(2) 4

A1(6) 4

B1(6) 4

A1(3) 4

B1(3) 4

A1(5) 4

B1(5) 4

{c0,s0}

{c0,s0}

D4(0)

E4(0)

{c1,s1}

{c1,s1}

D4(1)

E4(1)

{c2,s2}

{c2,s2}

D4(2)

E4(2)

{c3,s3}

{c3,s3}

D4(3)

E4(3)

A0(0) 5

A0(16) 5

A0(8) 5

B0(8) 5

A0(1) 5

B0(1) 5

A0(15) 5

B0(15) 5

A0(2) 5

B0(2) 5

A0(14) 5

B0(14) 5

A0(3) 5

B0(3) 5

A0(13) 5

B0(13) 5

{c4,s4}

{c4,s4}

D4(4)

E4(4)

{c7,s7}

{c7,s7}

D4(5)

E4(5)

{c6,s6}

{c6,s6}

D4(6)

E4(6)

{c5,s5}

{c5,s5}

D4(5)

E4(5)

A0(4) 5

B0(4) 5

A0(12) 5

B0(12) 5

A0(7) 5

B0(7) 5

A0(9) 5

B0(9) 5

A0(6) 5

B0(6) 5

A0(10) 5

B0(10) 5

A0(5) 5

B0(5) 5

A0(11) 5

B0(11) 5

Fig. 6. Modified DFG of 32-point decimation-in-time real-valued FFT.

TABLE IITWIDDLE FACTOR INDEX SEQUENCE USED IN THE MODIFIED DFG

BF Stage Twiddle factor index sequence

stage-30,1

(N = 8)stage-4

0,1,2,3(N = 16)

stage-50,1,2,3,4,7,6,5

(N = 32)stage-6

0,1,2,3,4,7,6,5,8,15,14,13,12,9,10,11(N = 64)

stage-7 0,1,2,3,4,7,6,5,8,15,14,13,12,9,10,11(N = 128) 16,31,30,29,28,25,26,27,24,17,18,19,20,23,22,21

stage-80,1,2,3,4,7,6,5,8,15,14,13,12,9,10,11

16,31,30,29,28,25,26,27,24,17,18,19,20,23,22,21(N = 256) 32,63,62,61,60,57,58,59,56,49,50,51,52,55,54,53

48,33,34,35,36,39,38,37,40,47,46,45,44,41,42,43

derive the MFG of the DIT RFFT of different lengths.

B. The Control Design and Storage Structure

In the proposed structure, the in-place RFFT computationbased on the MFG of Fig.4, is to be performed by one 4-point BF processing unit and a data storage unit. The datastorage unit stores the inputs as well as the intermediate resultsproduced after the computation of different stages. In everyclock cycle, 4 input samples/intermediate results are, therefore,read from the data storage unit and accordingly four resultvalues are written into the same storage locations. As shownin Fig.4, the data-input pattern changes from the BF stage-2onwards. The change in input pattern results in the increase ofcomplexity of address generation unit which needs to be takencare of to perform appropriate read/write operations with theregister banks. From the analysis of the MFG of Fig.4, we findthat only one pair of input values out of two such pairs in each4-point BF of stage-3 or higher stages need to be reorderedwhile the other pair of input samples are to be fed directlyfrom the corresponding BF of the previous stage. However,the reordered input pair could be any one from the two pairs.To resolve the addressing issue we swap the input pairs beforewriting them into the register banks. The corresponding pairs


TABLE IIIREAD AND WRITE OPERATIONS FOR DIFFERENT STAGES OF BUTTERFLY

OPERATIONS OF 16-POINT RFFT

clock cycle Reading for BF stage Writing after BF stage1-4 S1 S2

5-8 S2 S3

9-12 S3 S4

13-16 S4 S1

Si, for i = 1, 2, 3, and 4 represents the ith butterfly stage .

of data when read from the register banks also need to bereverse swaped to feed appropriate data to the BF processingunit. Therefore, by introducing two data-swapping operations,the memory conflict for in-place computation of DIT RFFT iscompletely resolved.

The data storage unit is comprised of four register banks(RB1, RB2, RB3, and RB4), so that a block of 4 words canbe read/stored from/in the four banks in each cycle using acommon address for reference. Besides, the data storage unitconsists of two data-selectors (DS1 and DS2) for swapping ofdata before and after the write and read operations, respec-tively. The internal structure of the data storage unit is shownin Fig.7. We implement each register-bank by (N/4) registerswhere both read and write operations can be performed in thesame clock cycle. We discuss here the control design for theproposed register-based storage unit of 16-point DIT RFFT.Similar control design however can easily be derived for higherFFT sizes, as well.

For 16-point RFFT, each register bank (shown in Fig.8)consists of (N/4) = 4 registers, one 2-to-4 line decoder andone 4-to-1 line multiplexer (word-level). During each clockcycle of a set of 4 clock cycles, content of one register of

DATA SELECTOR (DS2)

From 4-point butterfly unit

q0 q1

CLK

ctr1

r1

r0

To 4-point butterfly unit

RB2 RB3 RB4 w1

w0

ctr2

DATA SELECTOR (DS1)

RB1

Fig. 7. Data storage unit of 16-point RFFT for folded in-place computation.

R1 R2 R3 R4

Decoder

r0 r1

CLK

Din

Dout

w0 w1

(a)

en en en en

ctr1 /ctr2

u1 u2 u3 u4

v1 v2 v3 v4

(b)

Fig. 8. (a) Register-bank for N = 16. (b) Structure of data-selector (DS).

TABLE IVREGISTER BANK LOADING OF IN-PLACE 16-POINT RFFT

CC RB1 RB2 RB3 RB4

CC0.1 x(0) x(12) x(8) x(4)CC0.2 x(2) x(14) x(10) x(6)CC0.3 x(1) x(13) x(9) x(5)CC0.4 x(3) x(15) x(11) x(7)

CC1.1 u(0) u(8) u(4) u(12)CC1.2 u(2) u(10) u(6) u(14)CC1.3 u(1) u(9) u(5) u(13)CC1.4 u(3) u(11) u(7) u(15)

CC1.5 v(0) v(4) v(8) v(12)CC1.6 v(10) v(14) v(2) v(6)CC1.7 v(1) v(5) v(9) v(13)CC1.8 v(11) v(15) v(3) v(7)

CC1.9 w(0) w(2) w(10) w(14)CC1.10 w(8) w(12) w(4) w(6)CC1.11 w(5) w(7) w(9) w(13)CC1.12 w(11) w(15) w(1) w(3)

CC1.13 x(16) x(28) x(24) x(20)CC1.14 x(18) x(30) x(26) x(22)CC1.15 x(17) x(29) x(25) x(21)CC1.16 x(19) x(31) x(27) x(23)

each bank is read according to the address available at MUXselect line and a new value is thus written during the nextclock transition in the same register, such that each registerof particular register-bank is accessed (for read followed bywrite) once in every four cycles. During each clock cycle, fourdata values are read from four register-banks to perform theBF operation of the i-th BF stage and the outputs of the BFoperations are written back into the same locations (registers)during the next clock transition for the (i + 1)-th stage,for 1 ≤ i ≤ log2N . For continuous processing, the writeoperations with the register bank for the first stage of the nextset of N input samples is performed concurrently with the readoperation of (log2N)-th stage (the final stage) of BF operationof current set of input. The read and write operations withthe register banks for different stages of BF operation of 16-point RFFT is shown in Table III for a set of 16 clock cycles,which is repeated for the next set of 16 clock cycles to processthe next 16-point input sequence. Based on the read/writepattern of Table III and necessary data reordering for conflict-free continuous read/write operations with the register-banksit is possible to perform the in-place computation of RFFT.The register-bank loading for RFFT computation of twosuccessive 16-point input x1 = {x(0), x(1), · · · , x(15)} andx2 = {x(16), x(17), · · · , x(31)} is given in Table IV for 20clock cycles, where clock cycles CC0.1- CC0.4 are the fourclock cycles during which input-vector x1 is loaded in theregister-banks. During the 16 clock cycles CC1.1 to CC1.16, theFFT of input x1 is computed. The 16-point intermediate data-vectors u1, v1, and w1 corresponding to BF stage-1, stage-2,and stage-3 of 16-point RFFT are loaded in the register banksduring CC1.1 to CC1.4, CC1.5 to CC1.8, CC1.9 to CC1.12,respectively. During CC1.13 to CC1.16, the FFT output of x1

is delivered out and during the same period the next inputsequence x2 is loaded into the register banks. One can findfrom Table IV and Fig.5 that the necessary data-blocks can beread and written continuously from and to the four register-banks without any conflict during different BF stages. The


TABLE VIREAD ADDRESS (RA) AND WRITE ADDRESS (WA) OF MEMORY BANKS

AND CONTROL SIGNALS FOR DATA-SELECTORS

clock RA1/ RA3/ WA1/ WA3/ ctrl1 ctrl2cycle RA2 RA4 WA2 WA4

1 00 00 00 11 0 02 01 01 01 10 0 03 10 10 10 01 0 04 11 11 11 00 0 0

5 00 00 00 00 0 06 01 01 01 01 1 07 10 10 10 10 0 08 11 11 11 11 1 0

9 00 01 00 00 0 010 01 00 01 01 0 111 10 11 10 10 1 012 11 10 11 11 1 1

13 00 11 00 01 0 014 01 10 01 00 0 015 10 01 10 11 0 116 11 00 11 10 0 1

input and output data-flow of DS1 and DS2 are given in TableV for 16 processing cycles of x1. It may be noted that theloading of subsequent input sequences happens during thelast four processing clock cycles of a period of 16 processingclock cycles of its previous input sequence. Using Table IVand Table V, the necessary read-addresses (RA) and write-addresses (WA) for register-banks and control signals ctr1and ctr2 for DS1 and DS1, respectively, are derived and shownin Table VI. The required address and control signals can beeasily generated by a 2-bit counter and a few additional gates.

IV. PROPOSED STRUCTURE

The proposed structure for N -point in-place DIT RFFTcomputation is shown in Fig.9. It consists of one arithmeticunit (AU), one data storage-unit (DSU), one twiddle-factorstorage unit (TFSU), and one control-unit (CU). During everycycle, the AU receives a block of 4 samples from the DSU anda pair of twiddle factors {cl, sl} from the TFSU. During the

TABLE VIISELECT SIGNAL STATUS FOR 16-POINT RFFT

Select clock cyclesSignal 1− 4 5− 8 9− 12 13− 16

sel1 {1, 1, 1, 1} {0, 0, 0, 0} {0, 0, 0, 0} {0, 0, 0, 0}sel2 {1, 1, 1, 1} {1, 1, 1, 1} {1, 0, 1, 0} {1, 0, 0, 0}sel3 {0, 0, 0, 0} {0, 0, 0, 0} {0, 0, 0, 0} {1, 1, 1, 1}

When sel1/sel2 =′ 1′ then I1 → O2 and I2 → O1 in LC1 and LC2

of AU (Fig.10) respectively. In case of MUX-array when sel3 =′ 1′, then{x1 → O1, x2 → O2, x3 → O3, x4 → O4}.

TABLE VIIILUT ADDRESSES FOR TWIDDLE FACTOR

cc 1 2 3 4 5 6 7 8

Address 000 000 000 000 001 001 001 001

cc 9 10 11 12 13 14 15 16

Address 010 011 010 011 100 101 110 111

Arithmetic Unit Data Storage Unit Twiddle-Factor Storage Unit

input vector output vector

Control Unit

Fig. 9. Proposed structure for in-place computation of DIT RFFT.

LC

1 LC

2

- +

- +

-

+

MU

X-A

RR

AY

To SU

From SU

x1 x2 x3 x4

y1 y2 y3

y4

sel1 sel2 sel3 From TSU ck, sk

1 1

2 2

1 1

2 2

Fig. 10. Structure of arithmetic unit (AU).

c0 s0

c0 s0

c0 s0

c2 s2

c0 s0

c1 s1

c2 s2

c3 s3

q1 q0 q2

Add

ress

Por

t

{cl, sl}

q3 CLK

Address Generator ROM

Rea

d D

ata

Fig. 11. Internal structure of twiddle-factor storage unit (TFSU).

first 12 clock cycles of a period of 16 clock cycles, it performsa 4-point BF operation in every clock cycle to produce a 4-point intermediate-output data block which is written back intothe DSU. During the last 4 clock cycles of a period of 16 clockcycles the FFT coefficients are delivered out as output of theAU. The structure and function of the DSU is described inSection III-B (Fig.7). The internal structure of AU is shownin Fig.10. AU uses two line-changers (LC1 and LC2) to steerthe BF outputs along the desired path according to the MFGshown in Fig.4. Both LC1 and LC2 move the input data fromports I1 and I2 to port O1 and O2, respectively, when thecontrol signal is ‘0’ otherwise they steer the input data fromports I1 and I2 to port O2 and O1, respectively. Both LC1

and LC2 are implemented by a pair of 2-to-1 line MUXes.The MUX array receives the samples {x(n)} from the inputport as well as from the AU. It uses signal sel3 to selectinput from the AU during first 12 clock cycles and from theinput-port during last 4 clock cycles of each set of 16 clockcycles. The select control for 16 clock cycles for 16-pointRFFT computation is given in Table VII.

The twiddle factors of 16-point RFFT are stored in a ROMlook-up-table (LUT) of 8-words as shown in Fig.11. It hasan address generation unit to produce appropriate addresses


TABLE VINPUT-OUTPUT DATA-FLOW OF DATA-SELECTORS.

clock DS1 DS2

cycle no. Input-block Output-block Input-block Output-block

CC1.1 {u(0), u(8), u(4), u(12)} {u(0), u(8), u(4), u(12)} {x(0), x(12), x(8), x(4)} {x(0), x(12), x(8), x(4)}CC1.2 {u(2), u(10), u(6), u(14)} {u(2), u(10), u(6), u(14)} {x(2), x(14), x(10), x(6)} {x(2), x(14), x(10), x(6)}CC1.3 {u(1), u(9), u(5), u(13)} {u(1), u(9), u(5), u(13)} {x(1), x(13), x(9), x(5)} {x(1), x(13), x(9), x(5)}CC1.4 {u(3), u(11), u(7), u(15)} {u(3), u(11), u(7), u(15)} {x(3), x(15), x(11), x(7)} {x(3), x(15), x(11), x(7)}CC1.5 {v(0), v(4), v(8), v(12)} {v(0), v(4), v(8), v(12)} {u(0), u(8), u(4), u(12)} {u(0), u(8), u(4), u(12)}CC1.6 {v(2), v(6), v(10), v(14)} {v(10), v(14), v(2), v(6)} {u(2), u(10), u(6), u(14)} {u(2), u(10), u(6), u(14)}CC1.7 {v(1), v(5), v(9), v(13)} {v(1), v(5), v(9), v(13)} {u(1), u(9), u(5), u(13)} {u(1), u(9), u(5), u(13)}CC1.8 {v(3), v(7), v(11), v(15)} {v(11), v(15), v(3), v(7)} {u(3), u(11), u(7), u(15)} {u(3), u(11), u(7), u(15)}CC1.9 {w(0), w(2), w(4), w(6)} {w(0), w(2), w(4), w(6)} {v(0), v(4), v(2), v(6)} {v(0), v(4), v(2), v(6)}CC1.10 {w(8), w(12), w(10), w(14)} {w(8), w(12), w(10), w(14)} {v(10), v(14), v(8), v(12)} {v(8), v(12), v(10), v(14)}CC1.11 {w(1), w(3), w(5), w(7)} {w(5), w(7), w(1), w(3)} {v(1), v(5), v(3), v(7)} {v(1), v(5), v(3), v(7)}CC1.12 {w(9), w(13), w(11), w(15)} {w(11), w(15), w(9), w(13)} {v(11), v(15), v(9), v(13)} {v(9), v(13), v(11), v(15)}CC1.13 {x(16), x(28), x(27), x(23)} {x(16), x(28), x(27), x(23)} {w(0), w(2), w(1), w(3)} {w(0), w(2), w(1), w(3)}CC1.14 {x(18), x(30), x(25), x(21)} {x(18), x(30), x(25), x(21)} {w(8), w(12), w(9), w(13)} {w(8), w(12), w(9), w(13)}CC1.15 {x(17), x(29), x(26), x(22)} {x(17), x(29), x(26), x(22)} {w(5), w(7), w(4), w(6)} {w(4), w(6), w(5), w(7)}CC1.16 {x(19), x(31), x(24), x(20)} {x(19), x(31), x(24), x(20)} {w(11), w(15), w(10), w(14)} {w(10), w(14), w(11), w(15)}

Counter (4-bit)

q3

q2

q1

q0

MU

X(2

-bit)

A

C (2

-bit)

MU

X(2

-bit)

D

D

r0

r1

w0

w1

Stag

e Se

lect

RA

/WA

for R

B1/

RB

2

MU

X(1

-bit)

D

ctr1

ctr2

23.qq

23.qq

sel2

sel1

sel3

CLK

023013 qqqqqq ++

Fig. 12. Control unit (CU). AC stands for AND cell.

during different clock cycles (shown in Table VIII for 16-point RFFT). The logic circuit at the input of 2-bit MUXof the address generation unit produces the upper 2-bits ofaddress of each BF stages. The four BF stages of 16-pointFFT are encoded by a pair of bits {q2, q3} used as the controlof the MUX to select the 2-bit address of particular BFstage. The MSB of the address word is directly obtained byANDing {q2, q3}. In general, a ROM LUT consisting N/2words is required for N -point RFFT, where log2N -to-1 MUX(log2(N/4)-bit) is used to produce upper log2(N/2) addressbits. The logic design of the CU is shown in Fig.12. It producescontrol signals for TFSU, RA/WR of RBs and control signalsfor DS0 and DS1 of DSU, and select signals of AU to be usedfor the implementation of different BF stages.

The AU receives a block of 4 samples from the DSUin every clock cycle to perform the BF operations. TheBF outputs are sent back to the DSU except during the[(N/4)(log2N − 1) + 1] to [(N/4)(log2N)] clock cycles.During this period, AU computes the outputs of final BF stageof N -point RFFT and (N/4) number of 4-point input-blocks

corresponding to a new N -point input-vector is loaded ontothe register bank. Therefore, during [(N/4)(log2N−1)+1] to[(N/4)(log2N)] clock cycles of every set of N clock cycles,N FFT coefficients {X(k)} are computed by the AU andN input samples of a new sequence {x(n)} are loaded intoregister banks. The computation FFT of an N -point sequenceis completed by the proposed structure in [(N/4)(log2N)]clock cycles.

V. COMPLEXITIES AND PERFORMANCE COMPARISON

The proposed structure consists of one DSU, one AU, oneTFSU, and one CU. The DSU consists of four register-banksand two data-selectors, where each register-bank is comprisedof (N/4) registers of width w-bits, one log2(N/4):(N/4)decoder, and one (N/4):1 line MUX. Each data-selector iscomprised of four 2:1 MUXes. The AU involves 4 realmultipliers, 6 real adders, 2 LCs, and one MUX-array, whereeach LC is comprised of two 2:1 MUXes. The MUX-arrayis comprised of four 2:1 MUXes. The TFSU involves oneROM unit of (N/2) words of 2W -bit width and an addressgeneration unit. The CU and the address generation unit ofTFSU involve only a few gates. The complexities of CU,ROM address generation unit, and decoders are negligiblecompared to the other components. We have not included thecomplexities of these components in the theoretical estimation(Table IX). The duration of clock period T = TMA + TFA +5TMX + TRM , where TMA, TFA, TMX , and TRM are, re-spectively, the computation time of multiplication followed byaddtion, full-adder delay, MUX delay and register-bank accessdelay. TRM = TAND +TD +TMX log2(N/4), where TD andTAND are, respectively, the flip-flop delay, and the AND-gatedelay. We have estimated hardware and time complexities ofproposed structure accordingly, and listed in Table IX withthose of the structures of [8] and [9]. The structure of [8] usestwo memory units (memory-1 and memory-2). Each memoryunit is comprised of 4 banks (for 1 PE case) where each bank


TABLE IXCOMPARISON OF HARDWARE AND TIME COMPLEXITIES

Structures Multiplier Adder REG RAM words* MUX/DMUX Minimum clock period Computation time (cc)

Jo et al [9] 12 22 0 8N 38 TMA + 2TA + 8TMX + 2TR(N/4) log4N or

(N/4)(log4(N/2) + 1)

Ayinala et al [8] 4 6 0 4N 38 TMA + TA + 8TMX + 2TR (N/4) log2N

Proposed 4 6 N 0 N + 12 TMA + TFA + 5TMX + TRM (N/4) log2N

*Single-port w-bit RAM, where w is the wordlength. TMA: computation time of multiplication followed by addtion, TA: real adder delay, TFA: full-adderdelay, TMX : 2:1 MUX delay, TR: RAM access delay, TRM : register-based memory-bank access delay.

TABLE XSYNTHESIS RESULTS OF PROPOSED STRUCTURES AND STRUCTURE OF [8] USING TMSC 65NM CMOS STANDARD CELL LIBRARY

Design N MCP (ns) Data Storage Unit Complete DesignArea (um2) Power (mW) Area (um2) Power (mW) ADP (um2.us) EPS (pJ)

Structure of 16 1.66 8926.6 1.8 17933.4 2.7 476.3 4.48

[8] 32 1.67 15793.6 3.1 24481.4 4.2 1308.3 8.76

64 1.7 29430.7 5.6 38530.8 7.5 6288.2 19.02

128 1.77 56917.8 10.6 65804.7 13.9 26090.3 43.03

Proposed 16 1.1 2257.9 0.9 8028.72 2.2 143.9 2.49

structure 32 1.17 4374.0 1.7 10513.44 2.9 492.0 4.29

64 1.32 7937.3 2.6 13879.80 3.3 1758.8 6.59

128 1.61 15130.4 4.9 21267.35 6.1 6478.9 14.63

Power estimated at MCP for both designs, area delay product (ADP)=area× MCP×(N/4) log2N , energy per sample(EPS)=power×MCP×[(N/4) log2N)]/N .

is of depth (N/4) and needs to perform simultaneous read-write operations. In order to perform simultaneous read-writeoperations, they are required to be implemented as dual-portRAM. N -word dual-port RAM can be implemented using twosingle-port RAMs of depth N words and width of w-bits,where w is word-length of input/output data. The structure of[8] involves 4N single-port RAM words of w-bit width or 2Ndual-port RAM words 2w bit-width.

As shown in Table IX, the proposed structure involves thesame number of multipliers and adders as that of [8], but itinvolves N registers and (N − 26) extra MUXes against 4Nsingle-port RAM words of the other. The proposed structurehas shorter clock period than that of [8] as TR and TRM arenearly the same for smaller N values. For higher values ofN , TRM may be marginally higher than TR. But the othercomponents except the register-bank of proposed structurehave less delay than those of [8]. Consequently, the overallclock period of the proposed structure is expected to beless than or nearly equal to that of [8] for higher valuesof N . Due to shorter clock period, the proposed structurehas less computation time than the structure of [8] as boththe structures involve the same number of clock cycles tocomplete the computation of N -point RFFT. Compared withthe structure of [9], the proposed structure involves 3 timesless multipliers, 3.6 times less adders, N extra registers and(N − 26) extra MUXes against 8N RAM words, and offersnearly half the throughput. The proposed register-based DSUrequires nearly half of the area of the RAM based storageunit and offers higher area saving for higher values of N .The folded FFT is used when the length of input sequence isnot very large size, and generally, less than 512. Moreover, forsmall FFT size N , RAM-based storage unit is more expensivecompared to the register-based storage unit. Therefore, the

proposed structure could offer better area-delay efficiency andpower efficiency than the existing structures for most practicalapplications which involve FFTs of length N = 256 or less.

We have coded the proposed design in VHDL for RFFT oflength 16, 32, 64, and 128. We have also coded the structure of[8] for the same FFT sizes using one processing element (2-butterfly) since that is the most efficient amongst the existingstructures. We have used the single-port RAM generated bySynopsis DesignWare for the implementation of data storageunit of [8] whereas delay flip-flop (D-FF) is used for imple-mentation of proposed register-based DSU. We have taken 8-bit word-length for the input and the output for all stages ofcomputation. We have synthesized the proposed structure andthe structure of [8] including its data storage unit in SynopsysDesign Compiler (DC) and IC Compiler (ICC) using TSMC65nm CMOS standard cell library. The area, minimum cycleperiod (MCP), and power consumption reported by ICC arelisted in Table X. Power consumption is estimated at the MCPfor both the designs. As per the theoretical estimates givenin Table IX, the MCP of the proposed structure increasesmarginally when N increases. As shown in Table X, theproposed register-based DSU involves ∼ 73% less area andconsumes ∼ 50% less power than the RAM-based DSU of[8] on average for different FFT sizes. The proposed structureinvolves ∼ 61% less area and ∼ 40% less power consumptionthan those of [8] on average for different FFT sizes. As shownin Table X, the proposed structure involves ∼ 70% less area-delay product (ADP) and ∼ 57% less energy per sample(EPS)2 than those of [8] on average for different FFT sizes.

2ADP = Area × MCP×N4log2N

EPS= 1N

[Power Consumption × MCP × N4log2N ]


VI. CONCLUSIONS

In this paper, we present an area-efficient and energy-efficient architecture for radix-2 DIT real-valued FFT. Besides,we have proposed a register-based storage design which in-volves significantly less storage area compared with RAM-based storage unit at the cost of marginal increase in latency.The address generation for folded in-place DIT RFFT forregister-based storage-unit is challenging since both read andwrite are performed in the same clock cycle at multiple differ-ent locations. Therefore, we have regularized the flow graph ofRFFT and presented a recursive formulation of the necessaryaddress generation. The proposed structure involves signifi-cantly less area-delay product and less energy per sample thanthe existing folded structures for RFFT implementation.

REFERENCES

[1] H. Sorensen, D. Jones, M. Heideman, and C. Burrus, ”Real-valued fastFourier transform algorithms,” IEEE Transactions on. Acoust., SpeechSignal Process., vol. 35, no.6, pp. 849863, Jun. 1987.

[2] Shen-Fu Hsiao and Wei-Ren Shiue, ”Design of Low-Cost and High-Throughput Linear Arrays for DFT Computations: Algorithms, Architec-tures and Implementations,” in Proc. IEEE Trans. Circuits and Systems-II: Analog and Digital Signal Processing, Nov 2000, vol. 47, no. 11,pp. 1188-1203.

[3] M. Garrido, K. K. Parhi, J. Grajal, ”A pipelined FFT architecturefor real-valued signals,” IEEE Transactions on Circuits and Systems-I, Regular Papers, vol. 56, no.12, pp. 2634-2643, Dec. 2009.

[4] M. Ayinala, M. Brown, K. K. Parhi, ”Pipelined parallel FFT architecturesvia folding transformation,” IEEE Transactions on Very Large ScaleIntegration Systems, vol. 20, no. 6, pp. 1068-1081, June 2012.

[5] A. Wang and A. P. Chandrakasan, ”Energy-aware architectures for a realvalued FFT implementation,” in Proc. Int. Symp. Low Power Electron.Design, Aug. 2003, pp. 360-365.

[6] H. Chi and Z. Lai, ”A cost-effective memory-based real-valued FFTand Hermitian symmetric IFFT processor for DMT-based wire-linetransmission systems,” in Proc. IEEE Int. Symp. Circuits Syst., May2005, vol. 6, pp. 6006-6009.

[7] L. G. Johnson, ”Conflict free memory addressing for dedicated FFThardware,” IEEE Transactions on Circuits and Systems-II., Analog Digit.Signal Process., vol. 39, no. 5, pp. 312316, May 1992.

[8] M. Ayinala, Y. Lao and K. K. Parhi, ”An in-place FFT architecturefor real-valued signals”, IEEE Transaction on Circuits and System-II,Express Briefs, vol. 60, no. 10, pp. 652656, Oct. 2013.

[9] B. G. Jo and M. H. Sunwoo, ”New continuous-flow mixed-radix(CFMR) FFT processor using novel in-place strategy”, IEEE Transactionon Circuits and System-I, Regular Papers, vol. 52, no. 5, pp. 911919,May 2005.

Pramod Kumar Meher (SM’03) received hisPh.D. degree in science in the year 1996 fromSambalpur University (India), and currently heis a Senior Research Scientist with NanyangTechnological University, Singapore. He hascontributed more than 200 technical papers listed inhttp://www.ntu.edu.sg/home/ASPKMeher/List.pdf.Dr. Meher has served as a speaker for theDistinguished Lecturer Program of IEEE Circuitsand Systems Society during 2011 and 2012. Hehas also served as Associate Editor of IEEE

Transactions on Circuits Systems-II: Express Briefs, IEEE Transactions onCircuits & Systems-I: Regular Papers, and IEEE Transactions on VLSISystems, during 2008-2011, 2012-2013, and 2009-2014, respectively.

Basant Kumar Mohanty (M’06, SM’11) receivedPh.D degree in 2000, and currently he is fullprofessor in Jaypee University of Engineering andTechnology, Guna, Madhya Pradesh. His researchinterest includes design and implementation of low-power and high-performance systems for digital sig-nal processing applications, secured communicationand reconfigurable architectures. He has publishednearly 60 technical papers. Dr.Mohanty is serving asAssociate Editor for the Journal of Circuits, Systems,and Signal Processing.

Sujit Kumar Patel received B.E degree in elec-tronics and communication engineering from Ja-balpur Engineering College, Jabalpur, M.P, Indiaand M.Tech. degree from the DA-IICT, Gandhina-gar, Gujarat, India in 2006 and 2009, respectively.Currently he is an Assistant Professor with ECEdepartment, Jaypee University of Engineering andTechnology, Guna, Madhya Pradesh, India where heis pursuing his Ph.D degree. His research interests onalgorithms and concurrent architectures for adaptivefilters.

Soumya Ganguly received the B.E (Hons.) de-gree in Electrical and Electronics Engineering fromBirla Institute of Technology and Science, Hyder-abad Campus, India, in 2014. His research interestsinclude VLSI Architectures for Signal Processingapplications, and Computer Arithmetic. Mr. Gangulyhas received the Silver Leaf Certificate at the 2012IEEE Asia Pacific Conference on Postgraduate Re-search in Microelectronics and Electronics, and theVLSI Society of India Fellowship in 2014.

Thambipillai Srikanthan (SM’92) is a full Pro-fessor in the School of Computer Engineering inNanyang Technological University, Singapore. Heis the Chair of the same school since February2010. He is the Director of a 100-strong Centre forHigh Performance Embedded Systems (CHiPES),which he founded in 1998. Previously he wasalso the Director of the Intelligent Devices andSystems (IDeAS) Cluster at NTU. His researchinterests include design methodologies for high-productivity embedded systems, architectural trans-

lations of compute-intensive algorithms, computer arithmetic and high-speedtechniques for vision- enabled systems. He has published more than 400technical papers including 100 papers in various IEEE Transactions, IEEproceedings and other reputed international journals.

Date post:	10-Apr-2016
Category:	Documents
Upload:	hemanthbbc
View:	26 times
Download:	0 times

Eff Arch for Dit Fft

Documents