FFT Accelerator ProjectRohit Prakash (2003CS10186)Anand Silodia (2003CS50210)
14th September, 2007
Dr. Kolin Paul
Prof. M. Balakrishnan
Supervisors :
Overview
• Objective– To work out strategies for
implementing efficient FFT kernel on multiprocessors and FPGA
– To identify the bottlenecks
Previous Work (single processor software implementation)
• Examined 3 FFT algorithms – – Radix-4– Radix-16– Radix-8
• Compared them with FFTW• Analysed these on the following
parameters – Execution Time– Number of Complex calculations– Memory references
• Vectorized the code with gcc
Previous Work : Inference
• For smaller input sizes, cache misses are greatest for radix-16 (there’s a linear increase in misses from radix-4 to radix-16)
• But for large input sizes, (>= 4096), the number of cache misses in radix-8 is the lowest.
• Due to OOP, Complex (object) creation takes the maximum amount of Clock-ticks
• Apart from that, the maximum time is taken by complex multiplications, followed by complex additions and complex subtractions
Hardware implemetation : performance issues
• Circuit area• Power consumption• Speed
Algorithms : Cooley Tukey
• Pros:– Because the Cooley-Tukey algorithm
breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT.
• Cons:– Much hardware required (16-point fft :
176 add and 72 multiply operations )
Algorithms : Winograd
• Pros:– Designed to minimize the number of
multiplies – Much less hardware than Cooley Tukey
required (16-point fft :74 add and 18 multiply operations )
• Cons:– Highly irregular addressing sequence, which
makes it very inefficient to perform with a microprocessor
– awkward to factor for input sizes greater than 16
Guidelines for a suitable algorithm
• Construct larger FFTs from small 4-, 8-, and 16-point FFT kernels
• These smaller kernels can be Winograd
• 8 point FFT is a very special case, as the multiplication can be completely replaced by addition and bit-shift operations
• 16 point FFT can itself be decomposed into 4-point or 2- and 8-point FFTs
Multiprocessor FFT : Distributing Butterfliesa[0 ] a[1 ] a[2 ] a[3 ] a[4 ] a[5 ] a[6 ] a[7 ] a[8 ] a[9 ] a[1 0 ] a[1 1 ] a[1 2 ] a[1 3 ] a[1 4 ] a[1 5 ]
y [0 ] y [1 ] y [2 ] y [3 ] y [4 ] y [5 ] y [6 ] y [7 ] y [8 ] y [9 ] y [1 0 ] y [11 ] y [1 2 ] y [1 3 ] y [1 4 ] y [1 5 ]
Input
Output
Distributing the butterflies on different processors would involve more IPC
Distributing Input Spacea[0 ] a[1 ] a[2 ] a[3 ] a[4 ] a[5 ] a[6 ] a[7 ] a[8 ] a[9 ] a[1 0 ] a[1 1 ] a[1 2 ] a[1 3 ] a[1 4 ] a[1 5 ]
y [0 ] y [1 ] y [2 ] y [3 ] y [4 ] y [5 ] y [6 ] y [7 ] y [8 ] y [9 ] y [1 0 ] y [11 ] y [1 2 ] y [1 3 ] y [1 4 ] y [1 5 ]
Input
Output
Distributing the input space on different processors would involve less IPC
Distributing FFTs
Bandwidth MeasurementData send between Abhogi and saveri at 2pm (avg. 5.4MBps)
Bandwidth MeasurementData send between jaunpuri and saveri at 11pm (avg. 5.6MBps)
Assumptions
• Let TN denote the time taken to compute the FFT of input size N
• Let the network bandwidth be B (bytes/sec)
• Let the number of processors be p• Let the time taken to combine two
N-point FFTs be KN
4 processor model
Input : N points
(N/2) points (N/2) points
(N/4) pts (N/4) pts (N/4) pts (N/4) pts
FFT(N/4) FFT(N/4) FFT(N/4) FFT(N/4)
Combine Combine
Combinetransfer
transfer
transfer
transfer
transfer
transfer
Processor1
Processor1
Processor1
Processor1
Processor1
Processor1
Processor3 Processor2Processor4
Processor2
Processor2
Processor2
Processor3 Processor4
Pipelined structure
Send(2)Recv(1)
Send(3)Send(4)
Recv(1)Recv(1)
P1
P2P3
P4 FFT(N/4)FFT(N/4)
FFT(N/4)FFT(N/4)
Send(1)
Send(2)
Recv(1)
Recv(3)
Combine
Combine
Recv(1)
Send(1)
Combine
(N/2B) (N/4B) (N/4B) (N/2B)(TN/4) (KN/4B) (KN/2B)
The Execution time :
2((N/2B) + (N/4B)) + (TN/4) + (KN/2B)
= (3N/2B) + (TN/4) + (KN/2B)
Generalizing this
• For p processors, the total execution time is :
(TN/p) + (1 – 1/p)(2N/B + KN)
Plot (with real values)
Input Size = 65536
0.0185
0.019
0.0195
0.02
0.0205
0.021
0.0215
0.022
0.0225
1 2 3 4 5 6 7 8 9 10
No. of Processors (p)
Tim
e (s
)
65536
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
1 2 3 4 5 6 7 8 9 10
No. of Processors (p)
Tim
e (s
)
65536
262144
0
50
100
150
200
250
300
350
400
450
1 2 3 4 5 6 7 8 9 10
No. of processors (p)
Tim
e (s
)
65536
262144
1048576
4194304
16777216
67108864
268435456
Further Work
• Multiprocessor Implementation– Implement the above model and
validate it
• Hardware Implementation– Pipelining – Best utilization of the FPGA resources
References
• http://www.embedded.com/columns/technicalinsights/199203914?_requestid=265790
• Hugget,Maharatna,Paul On the implementation of 128-pt FFT/IFFT for High-Performance WPAN
• Michael J. Quinn, Parallel Programming in C with MPI and OpenMP
Thank You