+ All Categories
Home > Documents > FFT Accelerator Project

FFT Accelerator Project

Date post: 25-Jan-2016
Category:
Upload: jui
View: 22 times
Download: 0 times
Share this document with a friend
Description:
FFT Accelerator Project. Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210). 14 th September, 2007. Supervisors :. Dr. Kolin Paul Prof. M. Balakrishnan. Overview. Objective To work out strategies for implementing efficient FFT kernel on multiprocessors and FPGA - PowerPoint PPT Presentation
23
FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) 14 th September, 2007 Dr. Kolin Paul Prof. M. Balakrishnan Superviso rs :
Transcript
Page 1: FFT Accelerator  Project

FFT Accelerator ProjectRohit Prakash (2003CS10186)Anand Silodia (2003CS50210)

14th September, 2007

Dr. Kolin Paul

Prof. M. Balakrishnan

Supervisors :

Page 2: FFT Accelerator  Project

Overview

• Objective– To work out strategies for

implementing efficient FFT kernel on multiprocessors and FPGA

– To identify the bottlenecks

Page 3: FFT Accelerator  Project

Previous Work (single processor software implementation)

• Examined 3 FFT algorithms – – Radix-4– Radix-16– Radix-8

• Compared them with FFTW• Analysed these on the following

parameters – Execution Time– Number of Complex calculations– Memory references

• Vectorized the code with gcc

Page 4: FFT Accelerator  Project

Previous Work : Inference

• For smaller input sizes, cache misses are greatest for radix-16 (there’s a linear increase in misses from radix-4 to radix-16)

• But for large input sizes, (>= 4096), the number of cache misses in radix-8 is the lowest.

• Due to OOP, Complex (object) creation takes the maximum amount of Clock-ticks

• Apart from that, the maximum time is taken by complex multiplications, followed by complex additions and complex subtractions

Page 5: FFT Accelerator  Project

Hardware implemetation : performance issues

• Circuit area• Power consumption• Speed

Page 6: FFT Accelerator  Project

Algorithms : Cooley Tukey

• Pros:– Because the Cooley-Tukey algorithm

breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT.

• Cons:– Much hardware required (16-point fft :

176 add and 72 multiply operations )

Page 7: FFT Accelerator  Project

Algorithms : Winograd

• Pros:– Designed to minimize the number of

multiplies – Much less hardware than Cooley Tukey

required (16-point fft :74 add and 18 multiply operations )

• Cons:– Highly irregular addressing sequence, which

makes it very inefficient to perform with a microprocessor

– awkward to factor for input sizes greater than 16

Page 8: FFT Accelerator  Project

Guidelines for a suitable algorithm

• Construct larger FFTs from small 4-, 8-, and 16-point FFT kernels

• These smaller kernels can be Winograd

• 8 point FFT is a very special case, as the multiplication can be completely replaced by addition and bit-shift operations

• 16 point FFT can itself be decomposed into 4-point or 2- and 8-point FFTs

Page 9: FFT Accelerator  Project

Multiprocessor FFT : Distributing Butterfliesa[0 ] a[1 ] a[2 ] a[3 ] a[4 ] a[5 ] a[6 ] a[7 ] a[8 ] a[9 ] a[1 0 ] a[1 1 ] a[1 2 ] a[1 3 ] a[1 4 ] a[1 5 ]

y [0 ] y [1 ] y [2 ] y [3 ] y [4 ] y [5 ] y [6 ] y [7 ] y [8 ] y [9 ] y [1 0 ] y [11 ] y [1 2 ] y [1 3 ] y [1 4 ] y [1 5 ]

Input

Output

Distributing the butterflies on different processors would involve more IPC

Page 10: FFT Accelerator  Project

Distributing Input Spacea[0 ] a[1 ] a[2 ] a[3 ] a[4 ] a[5 ] a[6 ] a[7 ] a[8 ] a[9 ] a[1 0 ] a[1 1 ] a[1 2 ] a[1 3 ] a[1 4 ] a[1 5 ]

y [0 ] y [1 ] y [2 ] y [3 ] y [4 ] y [5 ] y [6 ] y [7 ] y [8 ] y [9 ] y [1 0 ] y [11 ] y [1 2 ] y [1 3 ] y [1 4 ] y [1 5 ]

Input

Output

Distributing the input space on different processors would involve less IPC

Page 11: FFT Accelerator  Project

Distributing FFTs

Page 12: FFT Accelerator  Project

Bandwidth MeasurementData send between Abhogi and saveri at 2pm (avg. 5.4MBps)

Page 13: FFT Accelerator  Project

Bandwidth MeasurementData send between jaunpuri and saveri at 11pm (avg. 5.6MBps)

Page 14: FFT Accelerator  Project

Assumptions

• Let TN denote the time taken to compute the FFT of input size N

• Let the network bandwidth be B (bytes/sec)

• Let the number of processors be p• Let the time taken to combine two

N-point FFTs be KN

Page 15: FFT Accelerator  Project

4 processor model

Input : N points

(N/2) points (N/2) points

(N/4) pts (N/4) pts (N/4) pts (N/4) pts

FFT(N/4) FFT(N/4) FFT(N/4) FFT(N/4)

Combine Combine

Combinetransfer

transfer

transfer

transfer

transfer

transfer

Processor1

Processor1

Processor1

Processor1

Processor1

Processor1

Processor3 Processor2Processor4

Processor2

Processor2

Processor2

Processor3 Processor4

Page 16: FFT Accelerator  Project

Pipelined structure

Send(2)Recv(1)

Send(3)Send(4)

Recv(1)Recv(1)

P1

P2P3

P4 FFT(N/4)FFT(N/4)

FFT(N/4)FFT(N/4)

Send(1)

Send(2)

Recv(1)

Recv(3)

Combine

Combine

Recv(1)

Send(1)

Combine

(N/2B) (N/4B) (N/4B) (N/2B)(TN/4) (KN/4B) (KN/2B)

The Execution time :

2((N/2B) + (N/4B)) + (TN/4) + (KN/2B)

= (3N/2B) + (TN/4) + (KN/2B)

Page 17: FFT Accelerator  Project

Generalizing this

• For p processors, the total execution time is :

(TN/p) + (1 – 1/p)(2N/B + KN)

Page 18: FFT Accelerator  Project

Plot (with real values)

Input Size = 65536

0.0185

0.019

0.0195

0.02

0.0205

0.021

0.0215

0.022

0.0225

1 2 3 4 5 6 7 8 9 10

No. of Processors (p)

Tim

e (s

)

65536

Page 19: FFT Accelerator  Project

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

1 2 3 4 5 6 7 8 9 10

No. of Processors (p)

Tim

e (s

)

65536

262144

Page 20: FFT Accelerator  Project

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6 7 8 9 10

No. of processors (p)

Tim

e (s

)

65536

262144

1048576

4194304

16777216

67108864

268435456

Page 21: FFT Accelerator  Project

Further Work

• Multiprocessor Implementation– Implement the above model and

validate it

• Hardware Implementation– Pipelining – Best utilization of the FPGA resources

Page 22: FFT Accelerator  Project

References

• http://www.embedded.com/columns/technicalinsights/199203914?_requestid=265790

• Hugget,Maharatna,Paul On the implementation of 128-pt FFT/IFFT for High-Performance WPAN

• Michael J. Quinn, Parallel Programming in C with MPI and OpenMP

Page 23: FFT Accelerator  Project

Thank You


Recommended