FFT Accelerator Project

transcript

FFT Accelerator ProjectRohit Prakash (2003CS10186)Anand Silodia (2003CS50210)

14th September, 2007

Dr. Kolin Paul

Prof. M. Balakrishnan

Supervisors :

Overview

• Objective– To work out strategies for

implementing efficient FFT kernel on multiprocessors and FPGA

– To identify the bottlenecks

Previous Work (single processor software implementation)

• Examined 3 FFT algorithms – – Radix-4– Radix-16– Radix-8

• Compared them with FFTW• Analysed these on the following

parameters – Execution Time– Number of Complex calculations– Memory references

• Vectorized the code with gcc

Previous Work : Inference

• For smaller input sizes, cache misses are greatest for radix-16 (there’s a linear increase in misses from radix-4 to radix-16)

• But for large input sizes, (>= 4096), the number of cache misses in radix-8 is the lowest.

• Due to OOP, Complex (object) creation takes the maximum amount of Clock-ticks

• Apart from that, the maximum time is taken by complex multiplications, followed by complex additions and complex subtractions

Hardware implemetation : performance issues

• Circuit area• Power consumption• Speed

Algorithms : Cooley Tukey

• Pros:– Because the Cooley-Tukey algorithm

breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT.

• Cons:– Much hardware required (16-point fft :

176 add and 72 multiply operations )

Algorithms : Winograd

• Pros:– Designed to minimize the number of

multiplies – Much less hardware than Cooley Tukey

required (16-point fft :74 add and 18 multiply operations )

• Cons:– Highly irregular addressing sequence, which

makes it very inefficient to perform with a microprocessor

– awkward to factor for input sizes greater than 16

Guidelines for a suitable algorithm

• Construct larger FFTs from small 4-, 8-, and 16-point FFT kernels

• These smaller kernels can be Winograd

• 8 point FFT is a very special case, as the multiplication can be completely replaced by addition and bit-shift operations

• 16 point FFT can itself be decomposed into 4-point or 2- and 8-point FFTs

Multiprocessor FFT : Distributing Butterfliesa[0 ] a[1 ] a[2 ] a[3 ] a[4 ] a[5 ] a[6 ] a[7 ] a[8 ] a[9 ] a[1 0 ] a[1 1 ] a[1 2 ] a[1 3 ] a[1 4 ] a[1 5 ]

y [0 ] y [1 ] y [2 ] y [3 ] y [4 ] y [5 ] y [6 ] y [7 ] y [8 ] y [9 ] y [1 0 ] y [11 ] y [1 2 ] y [1 3 ] y [1 4 ] y [1 5 ]

Output

Distributing the butterflies on different processors would involve more IPC

Distributing Input Spacea[0 ] a[1 ] a[2 ] a[3 ] a[4 ] a[5 ] a[6 ] a[7 ] a[8 ] a[9 ] a[1 0 ] a[1 1 ] a[1 2 ] a[1 3 ] a[1 4 ] a[1 5 ]

y [0 ] y [1 ] y [2 ] y [3 ] y [4 ] y [5 ] y [6 ] y [7 ] y [8 ] y [9 ] y [1 0 ] y [11 ] y [1 2 ] y [1 3 ] y [1 4 ] y [1 5 ]

Output

Distributing the input space on different processors would involve less IPC

Distributing FFTs

Bandwidth MeasurementData send between Abhogi and saveri at 2pm (avg. 5.4MBps)

Bandwidth MeasurementData send between jaunpuri and saveri at 11pm (avg. 5.6MBps)

Assumptions

• Let TN denote the time taken to compute the FFT of input size N

• Let the network bandwidth be B (bytes/sec)

• Let the number of processors be p• Let the time taken to combine two

N-point FFTs be KN

4 processor model

Input : N points

(N/2) points (N/2) points

(N/4) pts (N/4) pts (N/4) pts (N/4) pts

FFT(N/4) FFT(N/4) FFT(N/4) FFT(N/4)

Combine Combine

Combinetransfer

transfer

Processor1

Processor3 Processor2Processor4

Processor2

Processor3 Processor4

Pipelined structure

Send(2)Recv(1)

Send(3)Send(4)

Recv(1)Recv(1)

P4 FFT(N/4)FFT(N/4)

FFT(N/4)FFT(N/4)

Send(1)

Send(2)

Recv(1)

Recv(3)

Combine

Recv(1)

Send(1)

Combine

(N/2B) (N/4B) (N/4B) (N/2B)(TN/4) (KN/4B) (KN/2B)

The Execution time :

2((N/2B) + (N/4B)) + (TN/4) + (KN/2B)

= (3N/2B) + (TN/4) + (KN/2B)

Generalizing this

• For p processors, the total execution time is :

(TN/p) + (1 – 1/p)(2N/B + KN)

Plot (with real values)

Input Size = 65536

0.0185

0.0195

0.0205

0.0215

0.0225

1 2 3 4 5 6 7 8 9 10

No. of Processors (p)

1 2 3 4 5 6 7 8 9 10

No. of Processors (p)

262144

1 2 3 4 5 6 7 8 9 10

No. of processors (p)

262144

1048576

4194304

16777216

67108864

268435456

Further Work

• Multiprocessor Implementation– Implement the above model and

validate it

• Hardware Implementation– Pipelining – Best utilization of the FPGA resources

References

• http://www.embedded.com/columns/technicalinsights/199203914?_requestid=265790

• Hugget,Maharatna,Paul On the implementation of 128-pt FFT/IFFT for High-Performance WPAN

• Michael J. Quinn, Parallel Programming in C with MPI and OpenMP

Thank You

FFT Accelerator Project

Documents