Introduction to FFT Processors
Chih-Wei LiuVLSI Signal Processing LabDepartment of Electronics EngineeringNational Chiao-Tung University
FFT DesignFFT
• Consists of a series of complex additions and complex multiplications
Algorithm• Cooley-Tukey decomposition for power of two
length FFT
Architecture• Systematic mapping procedure
Algorithm LevelCooley-Tukey decomposition
Radix-2, decimation-in-frequency
12/
02/2/
1
0
)12(12
12/
02/2/
1
0
22
)(
)(
N
n
nkN
nNNnn
N
n
knNnk
N
n
nkNNnn
N
n
knNnk
WWxxWxA
WxxWxA
Variants based on CT algorithmFixed radix: Radix-2, Radix-4, Radix-8, Radix-22
Mixed radix: Split-radix, Radix-2/8, Radix-2/4/8Number of addition
• Same for any mixed-radix or fixed-radix algorithm.Number of multiplication
• Depends on the reduction of trivial multiplications.
WNn
A2k+1
A2k
xn+N/2
xn
-1
Hence, increase additions
FFT AlgorithmsReview of Radix-2r algorithm
DIF(decimation in frequency) and DIT(decimation in time) versionRadix-2 algorithmRadix-4 and Radix-22 algorithmRadix-8 and Radix-23 algorithmSplit-radix 2/4 and Split-radix 2/8
FFT Algorithms
1
0
1
0
2
)()()(N
n
knN
N
n
knN
jWnxenxkX
.1,...1,0, Nk
4/3NNW
8/7 NNW
0NW
8/NNW
4/NNW
8/3NNW
2/NNW
8/5NNW
)]()[(22*)(
)]()[(22*)(
)1(22
)1(22
1
38
18
8/78/3
8/58/
4/34/
2/0
abjabWjba
abjbaWjba
jWW
jWW
jWW
WW
NN
NN
NN
NN
NN
NN
NNN
DFT
FFT AlgorithmsRadix-2 Algorithm
DIF Radix-2 Algorithm
12/
0 2
12/
0 2
)]2/()([)12(
)]2/()([)2(
N
n
nkN
nNl
N
n
nkNl
l
l
WWNnxnxkX
WNnxnxkX
.12/,,1,0 Nkl
Butterfly of Radix-2 Algorithm
DIF Form
FFT AlgorithmsRadix-4 Algorithm
Radix-22 Algorithm
14
N
0k
nk
4N
nlN
l34
l24
l41
1WWW4N3nxW
2NnxW
4Nnxnxlk4X ])()()()([)(
112112`1
112121212
4
)2(14/
0
14
0 4
)2(364
244
24
121
)]}4/3()1()4/([)()1()]2/()1()({[
])4/3()2/()4/()([
)24(
nkN
llnN
lllN
n
l
N
k
nkN
llnN
llllll
WWNnxNnxjNnxnx
WWWNnxWNnxWNnxnx
llkX
;3,2,1,0l
;1,0, 21 ll
;14/~01 Nk
.14/~01 Nk
FFT Algorithms
Butterfly of Radix-4 Algorithm
x(n)
x(n+N/4)
x(n+N/2)
x(n+3N/4)
a(n)
a(n+N/4)
a(n+N/2)
a(n+3N/4)
WN
0 n
WN
2 n
WN
1 n
WN
3 n
l = 0
l = 1
l = 2
l = 3
(Data Ordering: Digit Reversed)
k 1k 0
0
1
2
3
0123012301230123
X(0)
x( , )k1 k0
X( , )k0 k1
X(4)X(8)X(12)X(1)X(5)X(9)X(13)X(2)X(6)X(10)X(14)X(3)X(7)X(11)X(15)
FFT Algorithms
Data Ordering of Radix-4 (N=16)
00 0001 0010 0011 00
00 0000 0100 1000 11…
……
……
……
..
……
……
……
….
0k 1k
Digit-reversed ordering
x(n)
x(n+N/4)
x(n+N/2)
x(n+3N/4)
a(n)
a(n+N/4)
a(n+N/2)
a(n+3N/4)
l =01l =02
l =01
l =11
l =11
l =12
l =02
l =12
WN
0 n
WN
2 n
WN
1 n
WN
3 n
W4
1
FFT Algorithms
Butterfly of radix-22 Algorithm
(Data Ordering: Bit Reversed)
FFT Algorithms
0000100001001100…
……
……
……
…
0000000100100011…
……
……
……
…0
0
0
0
0
0
0
01
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
)( 0123 kkkkx
3k 2k 1k)( 3210 kkkkX
X(0)
X(8)
X(12)X(4)
0k
X(2)X(10)
8421
X(6)X(14)X(1)X(9)
X(13)X(5)
X(3)X(11)
X(15)X(7)
Data Ordering of Radix- (N=16)22
Bit-reversed ordering
nkN
nlN
llll
N
n
lll
m
nkN
nlN
lmN
n
m
N
n
nmNlkN
N
n
nlkN
WWWWNnxWNnxWNnxNnx
WNnxWNnxWNnxnx
WWWmNnx
WmNnxWnxlkX
8/842
44
18/
04
244
7
0 88
18/
0
7
0
18/
0
)8/)(8(1
0
)8(
}])8
7()8
5()8
3()8
([
])8
6()8
4()8
2()({[
])8
([
)8
()()8(
FFT AlgorithmsDIF Radix-8 Algorithm
;7,6,5,4,3,2,1,0l .18/~0 Nk
nkN
lllnN
llllll
N
n
llll
nkN
nlN
llll
N
n
lll
WWWNnxWNnxWWNnxWNnx
NnxWNnxWWNnxWnx
WWWWNnxWNnxWNnxNnx
WNnxWNnxWNnxnx
8/)24(2
82422
18/
02422
8/842
44
18/
04
244
123121121
1121
}))]8
7()8
3(())8
5()8
([(
))]8
6()8
2(())8
4()({[(
}])8
7()8
5()8
3()8
([
])8
6()8
4()8
2()({[
FFT AlgorithmsDIF Radix-23 Algorithm
)248( 123 lllkX
;1,0,, 321 lll .18/~0 Nk
FFT Algorithms
Butterfly of Radix-8 Algorithm
x ( n + 4 N /8 )
x ( n )
x ( n + N /8 )
x ( n + 2 N /8 )
x ( n + 3 N /8 )
x ( n + 5 N /8 )
x ( n + 6 N /8 )
x ( n + 7 N /8 )
l= 0
l= 1
l= 2
l= 3
l= 4
l= 5
l= 6
l= 7
W N
0 n
W N
1 n
W N
2 n
W N
3 n
W N
4 n
W N
5 n
W N
6 n
W N
7 n
l = 01
l = 01
l = 01
l = 01
l = 11
l = 11
l = 11
l = 11
l = 02
l = 12
l = 02
l = 12
l = 02
l = 12
l = 02
l = 12
l = 03
l = 13
l = 03
l = 13
l = 03
l = 13
l = 03
l = 13
W N
0 n
W N
4 n
W N
2 n
W N
6 n
W N
1 n
W N
5 n
W N
3 n
W
x(n)
x(n+N/8)
x(n+2N/8)
x(n+3N/8)
x(n+4N/8)
x(n+5N/8)
x(n+6N/8)
x(n+7N/8)N
7n
W 4
1
W 4
1
W 8
0
W 8
2
W 8
1
W 8
3
FFT Algorithms
Butterfly of Radix-23 Algorithm
nkN
nN
N
n
nkN
nN
N
n
nkN
N
n
WWNnxNnxjNnxnxkX
WWNnxNnxjNnxnxkX
WNnxnxkX
4314/
0
414/
0
2
12/
0
)]}4
3()4
([)4
2()({)34(
)]}4
3()4
([)4
2()({)14(
])4
2()([)2(
FFT Algorithms
DIF Split-Radix 2/4 Algorithm
k in X(2k) is from 0 to N/2-1, and in X(4k+1) and X(4k+3) are from 0 to N/4-1
FFT Algorithms
Butterfly of Split-Radix 2/4 Algorithm
W 4
1
W N
n
W N
3n
x (n )
x (n+N /4)
x(n+2N /4)
x (n+3N /4)
FFT AlgorithmsAdvantage of Radix-2/4 Algorithm
Low Computational ComplexityFlexible as radix-2 algorithmBit reversed output (when normally ordered input)
nkN
nlN
llll
N
n
lll
nkN
N
n
WWWWNnxWNnxWNnxNnx
WNnxWNnxWNnxnxlkX
WNnxnxkX
8/842
44
18/
04
244
212/
0
}])8
7()8
5()8
3()8
([
])8
6()8
4()8
2()({[)8(
])4
2()([)2(
FFT Algorithms
DIF Split-Radix 2/8 Algorithm
7,5,3,1l
x(n)
x(n+N /8)
x(n+2N /8)
x(n+3N /8)
x(n+4N /8)
x(n+5N /8)
x(n+6N /8)
x(n+7N /8)
-j
-j
W81
W83
FFT Algorithms
Butterfly of Split-Radix 2/8 Algorithm
Multiplicative ComplexityTrivial multiplications in FFT
Multiplied by• Radix-2: ±1 removed• Radix-4: ±1 and ±j (partially) removed• Split-radix(2/4): ±1 and ±j removed• Radix-8: ±1, ±j, (1±j)/2 (partially) removed• Radix-2/8: ±1, ±j, (1±j)/2 removed
Radix-4 Signal Flow Graph
Split-Radix Signal Flow Graph
Multiplicative ComplexityN Radix-2 Radix-4 Split-
RadixRadix-8 Const.
MulRadix-2/8
Const. Mul
8 2 3 2 0 2 0 2 16 10 8 8 6 4 4 6 32 34 31 26 20 8 16 14 64 98 76 72 48 32 44 38
128 258 215 186 152 64 120 94
256 642 492 456 376 128 308 214 512 1538 1239 1082 824 384 736 494
1024 3586 2732 2504 2104 768 1724 1126 2048 8194 6487 5690 4792 1536 3976 2494 4096 18434 13996 12744 10168 4096 8964 5494 8192 40962 32087 28218 23992 8192 19952 12046
How to obtain regular SR FFT architecture?
Architecture LevelMapping procedure
Systolic array techniques• Operation scheduling, resource sharing
Pipeline architecture• One-dimensional linear array• Delay-feedback vs. Delay-commutator.
Single PE architecture• Shared-memory, Single Processing Element (PE)
0
16w4
16w
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Z-8 Z-4
Z-4A
Z-2
Z-2
SW1 SW2Z-1
Z-1
SW3
Stage 1 Stage 2 Stage 3 Stage 4
VerticalProjection
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(9)
X(5)
X(13)
X(3)
X(11)
X(7)
X(15)
016w116w216w316w416w516w616w716w
016w216w416w616w
016w216w416w616w
0
16w4
16w
0
16w4
16w
0
16w4
16w
R2MDC Radix-2 Multi-Path Delay Commutator
Delay Commutator orDelay-Switch-Delay
1st and 2nd stages in R2MDC (N=16)
)( nx)(1 nx
B
C
F
G
D
E
H
I
Z -8 Z -4
Z -4
A
SW1
Stage 1 Stage 2
7 6 5 4 3 2 1 015 14 13 12 11 10 9 8
3 2 1 0
11 10 9 8
7 6 5 415 14 13 12
(1)(2)
Input pairs : N/2
Stage 1 Stage 2
(3)
Input pairs : N/4
(4)
(5)
(6)
(7)
Input pairs : N/8
(8)
Input pairs : N/16
(9)
(10)
(11)
(12)
(13)
(14)
(15)
Stage 3 Stage 4
0
16w4
16w
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x (10)
x (11)
x (12)
x (13)
x (14)
x (15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(9)
X(5)
X(13)
X(3)
X(11)
X(7)
X(15)
016w116w216w316w416w516w616w716w
016w216w416w616w
016w216w416w616w
0
16w
4
16w
0
16w4
16w
016w
4
16w
8 3 2 1
12 11 10 9
7 6 515 14 13
0
49 8 3 2
13 12 11 10
7 615 14
1 0
5 411 10 9 8
15 14 13 12
3 2 1 0
7 6 5 4
R4MDC Radix-4 Multi-Path Delay Commutator
0
0
00
0
0
0
0
0
0
00
0
0
00
0
0
0
00
1
2
30
2
4
6
0
3
6
9
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x (10)
x (11)
x (12)
x (13)
x (14)
x (15)
X(0)
X(4)
X(8)
X(12)
X(1)
X(5)
X(9)
X(13)
X(2)
X(6)
X(10)
X(14)
X(3)
X(7)
X(11)
X(15)
Stage 1 Stage 2
12
8
4
BF4
3
2
1
COMMUTATOR
1
2
3
BF4
Coefficients Coefficients
COMMUTATOR
Stage 1 Stage 2
Inputs
A B
ControlControl
RrMDC
Input stage
k th stage
stagesNr
r 1
stagesNr
r 2
Nr1
Computational Element
InputOutputs
Coefficients
1
1
krN
rr
Computational Element
COMMUTATOR
12
krN
rr
1
1kr
Nr
krN
rr 1
krN
rr 2
krN
r1
Outputsfrom
previousstage
Tonextstage
Coefficients
Commutator Control
(a)
(b)
Delay Feedback R2SDF
R4SDF
R22SDF
R2SDF(N=16) Radix-2 Single-Path Delay Feedback
R2SDF (N=16) vs. R4SDF (N=128)
BF4 BF4 BF4 BF4
646464
161616
444
111
Buffer Styles of pipeline architecture• R2 delay-commutator: inefficient (50%) MEM
usage. (R2MDC)
• R2 delay-feedback: 100% MEM usage.(R2SDF)
single BF_PE radix-2 shared memory architecture
RAM
BF
1
BF_PE
Single PE Architecture
Concluding RemarksThe Split-Radix algorithm has less computation complexity, comparing with the fixed Radix algorithm. However, its butterfly operation is irregular (L-shape).The processing speed of pipeline architecture is faster than single-PE architecture. However, the single PE architecture is the most area-efficient, especially for long length FFT/IFFT application.
Review Traditional FFT DesignSteps
1. Given N-point FFT spec., choose fixed-radix algorithm2. Design radix-r butterfly, multiplier, etc.3. Cascade logrN stages to compute N point FFT.
Arbitrary radix can be used Base on Cooley-Tukey decomposition for any composite number
Problem of Traditional Approach
Cannot drive architecture for mixed-Radix algorithmThe processing speed is no longer the critical issue any more nowadays.The chip area and the power consumption dominate the design quality. Re-configurable FFT/IFFT architecture design is necessary for various applications.
A length-scalable and latency-specified FFT/IFFT core is necessary.
Proposed Solution
We implement FFT module by single PE architecture
Radix-rButterfly
Processing ElementReg Reg
Mutiple-portMemory
Pre-fetchbuffer
Design IssuePerformance-enough, Chip area, power consumption.Scalable processing element.Limited Storage block(s).Efficient memory address generator.
Algorithm Level
We adopt split-radix 2/4 algorithm to realize the FFT module.
kn
N
N
nNnn WXX 2/
12/
02/ )(2kA
kn
Nn
N
N
nNnNnNnn
knN
nN
N
nNnNnNnn
WWXjXXjX
WWXjXXjX
4/3
14/
04/32/4/
4/
14/
04/32/4/
)(
)(
34k
14k
A
A
The Kernel of Processing Element
0A
1A
2A
3A
4A
5A
6A
7A
8A
10A
11A
12A
13A
14A
15A
9A
11111111 1
111
1111
11
11
11
11
1
1
1
1
1
1
1
1jjjj
j
j
j
08W1
8W0
8W3
8W
116W
016W
216W3
16W0
16W3
16W6
16W9
16W
jj
0X
1X
2X
3X
4X
5X
6X
7X
8X
10X
11X
12X
13X
14X
15X
9X
Folded Butterfly UnitsComparing with Radix-2/Radix-22, it saves half memory access times.
Butterfly unit
Butterfly unit
MuxMux
MuxMux
Feedback path
Storage Blocks
We use multiple single-port memory banks to replace the multi-port memory.The concept of conflict-free memory. (Vertex coloring problem)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
0
4
26
3
7
5
1
x(0)
x(3)
x(5)
x(6)
x(1)
x(2)
x(4)
x(7)
Bank0 Bank1
Scalable Memory Address GeneratorThere must exist a solution for such vertex coloring problem.The best solution --- The proposed Interleave Rotated Data Allocation (IRDA) algorithm.
RAM-0
Address Switcher (AS)
Address Generator(AG)for 64-length
AT AT AT AT
RAM-1 RAM-2 RAM-3 RAM-0
Rotator
Address Generator(AG)for 16-length
RAM-1 RAM-2 RAM-3
The IRDA ConceptA conflict-free memory banks.Simple and length-scalable design.The circular shift rotator.
00 01 02 0307 04 05 0610 11 08 0913 14 15 1219 16 17 1822 23 20 2125 26 27 2428 29 30 3134 35 32 3337 38 39 3640 41 42 4347 44 45 4649 50 51 4852 53 54 5559 56 57 5862 63 60 61
RAM-A RAM-B RAM-C RAM-D
Length-Scalable FFT/IFFT Core
Reg
Reg
Reg
Reg
Mux
Mux
Mux
Mux
Radix-2butterfly
processing elementRadix-2butterfly
processing element
Rotator
Mux
Mux
Mux
Mux
Rotator
Reg
Reg
Reg
Reg
Adder Reg
Addressgenerator
RAM-D
RAM-C
RAM-B
RAM-A
Further Performance Improvement
Multiple PEs architecture.2 pipeline PEs, for example.
RAM 0 RAM 1 RAM 2 RAM 3
00 02 04 06
14 08 10 12
20 22 16 18
26 28 30 24
RAM 4 RAM 5 RAM 6 RAM 7
01 03 05 07
15 09 11 13
21 23 17 19
27 29 31 25
Group1 Group2
The Cached-FFT Algorithm
Overview1. Input data are loaded into an N-word main memory.2. C of the N words are loaded into the cache.3. As many butterflies as possible are computed using the data
in the cache.4. Processed data in the cache are flushed to main memory.5. Steps 2-4 are repeated until all N words have been processed
once.6. Steps 2-5 are repeated until the FFT has been completed.
Processor cache Main Memory
Result 0x
15x7x11x3x13x5x9x1x14x6x10x2x
12x4x8x
0X
1X
2X
3X
4X
5X
6X
7X
8X
9X
10X
11X
12X
13X
14X
15X
W
W
W
W
W
W
W
W
WW
WW
WW
W
W
W
WWW
WWWW
WWWWWWWW
N=64, E=2, Radix-2 Cached-FFT