+ All Categories
Home > Documents > Design, Simulation, Implementation, and Performance Analysis of a fixed-point 8 Point FFT Core for...

Design, Simulation, Implementation, and Performance Analysis of a fixed-point 8 Point FFT Core for...

Date post: 20-Jul-2016
Category:
Upload: preeminentinception
View: 9 times
Download: 0 times
Share this document with a friend
Description:
- Fast Fourier Transform (FFT), which serves as an efficient and ubiquitous tool for computing Discrete Fourier Transform (DFT), is popular for transforming a signal from time domain to frequency domain. Since FFT algorithm requires less number of computations than direct evaluation of DFT, this technique has been widely used in speech recognition (massively used now days in many application lines and products), telecommunication, signal processing, multimedia communication, etc. Designing and implementing the floating-point (FP) FFT Algorithms in FPGA is always the hot research spot and is still a challenging task. This paper proposes a new architecture of an FFT core that computes radix-2 8-point FFT using fixed-point operation in only eight clock cycles. The key feature of this design is that it tries to maintain better performance with minimal possible footprint. The design is done in Xilinx ISE 13.2 tool using Verilog-HDL. The processor core has been simulated using Xilinx ISIM simulator for the functional verification and its FPGA based implementation has been successfully verified usingSpartan-3E Starter Kit. This paper also aggregates a brief analysis of the performance of FFT Core and the consumption of FPGA resources by the designed core. The objective of this work is to get an area and time efficient architecture that could be used as a part of a voice processing system
15
International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in Manuscript Id: iJARS/820 1 Research Article Design, Simulation, Implementation, and Performance Analysis of a fixed-point 8 Point FFT Core for Real Time Application in Verilog HDL Authors: 1 Bikash Poudel, 2 Manish Bhattrai*, 3 Sandesh Ghimire Address For correspondence: 1 Asst. Lecturer Institute of Engineering, Thapathali Campus 2 Assistant R&D Engineer, Powertech Nepal 3. Engineer, Nepal Electricity Authority Abstract - Fast Fourier Transform (FFT), which serves as an efficient and ubiquitous tool for computing Discrete Fourier Transform (DFT), is popular for transforming a signal from time domain to frequency domain. Since FFT algorithm requires less number of computations than direct evaluation of DFT, this technique has been widely used in speech recognition (massively used now days in many application lines and products), telecommunication, signal processing, multimedia communication, etc. Designing and implementing the floating-point (FP) FFT Algorithms in FPGA is always the hot research spot and is still a challenging task. This paper proposes a new architecture of an FFT core that computes radix-2 8-point FFT using fixed-point operation in only eight clock cycles. The key feature of this design is that it tries to maintain better performance with minimal possible footprint. The design is done in Xilinx ISE 13.2 tool using Verilog-HDL. The processor core has been simulated using Xilinx ISIM simulator for the functional verification and its FPGA based implementation has been successfully verified using Spartan-3E Starter Kit. This paper also aggregates a brief analysis of the performance of FFT Core and the consumption of FPGA resources by the designed core. The objective of this work is to get an area and time efficient architecture that could be used as a part of a voice processing system. Keywords: DFT, FFT, FPGA, Xilinx ISE 13.2, Verilog-HDL Introduction Audio signal processing is a well-developed line massively used these days in telecommunication, multimedia applications, speech recognition for voice-operated system, etc. When it comes to signal processing one always opts to work in the frequency domain because of [email protected] *Corresponding Author Email-Id
Transcript
Page 1: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 1

Research Article

Design, Simulation, Implementation, and Performance Analysis of a fixed-point 8 Point FFT Core for Real Time Application in Verilog

HDL

Authors:

1 Bikash Poudel, 2 Manish Bhattrai*, 3 Sandesh Ghimire

Address For correspondence: 1 Asst. Lecturer Institute of Engineering, Thapathali Campus

2 Assistant R&D Engineer, Powertech Nepal

3. Engineer, Nepal Electricity Authority

Abstract - Fast Fourier Transform (FFT), which serves as an efficient and ubiquitous tool for

computing Discrete Fourier Transform (DFT), is popular for transforming a signal from time

domain to frequency domain. Since FFT algorithm requires less number of computations than

direct evaluation of DFT, this technique has been widely used in speech recognition (massively

used now days in many application lines and products), telecommunication, signal processing,

multimedia communication, etc. Designing and implementing the floating-point (FP) FFT

Algorithms in FPGA is always the hot research spot and is still a challenging task. This paper

proposes a new architecture of an FFT core that computes radix-2 8-point FFT using fixed-point

operation in only eight clock cycles. The key feature of this design is that it tries to maintain

better performance with minimal possible footprint. The design is done in Xilinx ISE 13.2 tool

using Verilog-HDL. The processor core has been simulated using Xilinx ISIM simulator for the

functional verification and its FPGA based implementation has been successfully verified using

Spartan-3E Starter Kit. This paper also aggregates a brief analysis of the performance of FFT

Core and the consumption of FPGA resources by the designed core. The objective of this work is

to get an area and time efficient architecture that could be used as a part of a voice processing

system.

Keywords: DFT, FFT, FPGA, Xilinx ISE 13.2, Verilog-HDL

Introduction

Audio signal processing is a well-developed line massively used these days in

telecommunication, multimedia applications, speech recognition for voice-operated system, etc.

When it comes to signal processing one always opts to work in the frequency domain because of

[email protected] *Corresponding Author Email-Id

Page 2: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 2

myriad of advantages the frequency domain offers, which brings forward Discrete Fourier

Transform that converts a signal in discrete time domain to discrete frequency domain. The

arithmetic complexity of the Discrete Fourier Transform (DFT) algorithm becomes a significant

factor, which influences in global computational costs of a design. Cooley and Tukey [1]

developed the well-known radix-2 FFT algorithm to reduce the computational load of the DFT.

Based on how one divides a set of N inputs into two sets of N/2 numbers, there are two types of

radix-2 FFT algorithm or Cooley-Tucky algorithm: Decimation in time FFT (DIT-FFT) and

Decimation in frequency FFT (DIF-FFT). We have implemented decimation in frequency FFT

algorithm. To understand decimation in frequency we start by writing definition of DFT,

(1)

For even k, i.e. k=2m, (-1)k =1,

(2)

For odd k, i.e. k=2m+1, (-1)k = -1,

(3)

Using symbols

and

,

equations (2) and (3) can be written as

(4)

(5)

Thus, we started by dividing N inputs into two halves. Noticing that twiddle factor set ( ) is

similar in first and second half, we worked out that twiddle factor multiplication is same if k is

even and we need to multiply second half by certain power of twiddle factor( ) if k is odd.

Page 3: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 3

Thus we grouped two sets with some modifications, but now the number of twiddle factors is

halved and we have reached N/2 point DFT.

Equation (4) can be viewed as N/2 point DFT of and equation (5) can be viewed as N/2

point DFT of . Thus, N point DFT can be computed by evaluating two N/2 point DFT.

Figure 1: Illustration of N-point FFT using two N/2-point FFT

This process can be continued and N/2 point DFT can be computed by two N/4 point DFT. For

N=8, figure 1 corresponds to decimation of 8-point DFT into two 4-point DFTs. Further

decimating 4-point DFTs into 2-point DFTs we reach a butterfly structure as shown in figure 5.

As shown in figure 5, there are three stages. For N-point DFT, the number of stage is log2(N).

Thus, this kind of decimation reduces computational complexity from O(N2) to O(N.log2(N))

since computation in each stage is of order N.

Proposed Methodology

A. Functional Block Diagram of the Radix-2 8-point FFT Computer

The proposed system, which has a functional diagram as shown in figure 2, has divided the

computation of FFT in three stages- Input Stage, Compute Stage, and Output Stage. In Input

Stage, eight samples are read form the Analog to Digital Converter (ADC) and are stored in

8*64-bit Input Buffer, which takes eight clock cycles. Compute stage performs the computation

Page 4: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 4

of FFT out of eight input samples and generates eight frequency samples in eight clock cycles.

Compute stage has three blocks- x-Buffer to hold eight input samples from the input buffer, FFT

Core that computes the FFT, and temporary buffer that holds the intermediate results. Finally,

Output Stage presents the output in the output ports.

Figure 2: Proposed architecture of the radix-2 8-point FFT.

B. Implementation method of Butterfly Network

The computation of the FFT is done by implementing the Butterfly Network in a novel and

efficient way. Whereas the direct implementation of the butterfly arrangement requires twelve

subtracters, twelve adders and twelve multipliers, the FFT core presented in this paper uses four

adders, four subtracters and four multipliers in order to conserve resources without sacrificing

the performance of the network. Had all twelve adders, multipliers, and subtracters been used

then the output frequency samples would have been computed in one clock cycle since the whole

design will be a single combinational circuit, the output is generated as soon as new input is

available i.e. in one clock cycle. But, since only four of the adders, subtracters, and multipliers

each are used, the output samples are presented to output port only at the end of the eight clock

cycles because the calculation of FFT with butterfly network has been done using the FSM as

shown in figure 4 which will take 8 clock cycles to complete.

This architecture is a three-stage pipelined-architecture, so there are three independent and

concurrent stages, which are: Input Stage, Compute Stage, and Output Stage as shown in figure

5. The input stage takes eight clock cycles independent of the architecture, since it will always

take eight clock cycles to fetch eight input samples from ADC. Thus, in order to complete the

computation of FFT before the next set of input samples are available from ADC, Compute

Stage has at most seven clock cycles to compute the FFT and Output Stage has one clock cycle

to host output samples in the output ports without introducing extra cycle consumption in the

overall instruction cycle of the core. For this, the Compute Stage has been mathematically

divided into three sub-stages each sub-stages requiring four adders (A1, A2, A3, and A4), four

Page 5: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 5

subtracters (S1, S2, S3, and S4) and four multipliers (M1, M2, M3, and M4) as shown in figure

5. Here, for each of the three sub-stages of Compute Stage i.e. Compute Stage I, Compute Stage

II, and Compute Stage III; the four adders, four subtracters, and four multipliers are reused by

using the computational architecture as shown in figure 7 with the help of 4-to-1 multiplexers

whose one input line is not used.

Figure 3: Illustration of how a adder is reused in various sub-stages of Compute Stage using 4-

to-1 multiplexer

The crux behind the reusability of the adders, subtracters, and multipliers is that the compute

stage has been divided into three sub-stages as shown in figure 5 where each sub-stage require

each of four adders, subtracters, and multipliers. At first, input samples in x-Buffer are added or

subtracted and multiplied as per the butterfly network in Compute Stage I. The four adders,

subtracters, and multipliers are used to generate intermediate results in x-Buffer, which are used

as the inputs for the Compute Stage II. Next, in Compute Stage II the same four adders,

subtracters, and multipliers are reused to compute intermediate results to be used as input for the

Compute Stage III as shown in figure 5. Finally, in the Compute Stage III the same four adders,

subtracters, and multipliers are used again to generate final output frequency samples.

The reusability of the four adders, four subtracters, and four multipliers can be properly

illustrated with the help of figure 3. The reuse of the same set of component is done by using

multiplexers in the input lines of the component to select different inputs in different Compute

Stage. Say, the core is at COMPUTE STAGE I of FSM shown in figure 4 that corresponds to the

Compute Stage I of figure 5. The adder A1 has to add input samples x[0] and x[4] as dictated by

the butterfly network of figure 5. This is done by sending the 2’b00 from the controller FSM to

the multiplexer connected to the second port of Complex Adder A1 that allows multiplexer to

feed x[4] to the adder as shown in figure 3. The resulting sum evaluated by Complex Adder A1

is stored in the x[0] of x-Buffer, which previously contained first sample from ADC, in the

WRITE BACK I stage. In COMPUTE STAGE II of FSM which corresponds to Compute Stage

II of figure 5, the same Complex Adder A1 has to add the intermediate samples x[0] and x[2] by

sending 2’b01 selection line value from the controller FSM to select a sample for the second

Page 6: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 6

input of the Complex Adder. The sum is written back to x-Buffer in WRITEBACK STAGE II

stage. Finally, the same Complex Adder A1 has to sum up intermediate samples x[0] and x[1] in

COMPUTE STAGE III of FSM in figure 4 which corresponds to the Compute Stage III of figure

5, thus, generating the output frequency sample X[0]. The first input to the Complex Adder is

again x[0] but the second input to Complex Adder is x[1] which is selected with the help of the

multiplexer by sending 2’b10 in the selection line from the controller FSM. The output sample

X[0] is presented to the output port in WRITEBACK STAGE III. Another important point to

note about this design is that the same x-Buffer, which initially holds the input samples from

ADC, is used to hold the intermediate results of sub-stages of Compute Stage as shown in figure

5.

The complete architecture of the Core is shown in figure 7, which shows how the multiplexers

are incorporated with the adders, subtracters, and multipliers in their input ports to select

different set of inputs in different sub-stage of Compute Stage.

Figure 4: FSM that dictates how the FFT Core of figure 7 operates.

Page 7: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 7

Figure 5: Segregation of the Butterfly Network into three Stages-Input, Compute and Output

Stage.

The input samples from the ADC are in 14-bit 2’s complement form and the twiddle factors are

represented by 10-bit 2’s complement fixed-point number. Since the Twiddle factor is a complex

number, two 10-bits fixed-point numbers represents the real part and imaginary part respectively.

So during the twiddle factor multiplication at most 24-bits (14-bits value times 10-bits value

generates result that is at most 24-bits value) of the storage for each of real and imaginary part of

the samples is needed. However, in our actual implementation 32-bit for real and imaginary part

of each sample has been used so that we can go up to 16-bit (which is 10-bits for this particular

design) representation for the twiddle factor for more accurate representation of floating-point

number in fixed-point format in future modification of the core. All of the adders, subtracters,

Page 8: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 10

COMPUTE STAGE III and the results are finally sent to the output port from the Temp-Buffer

in OUTPUT STAGE. Thus, from the FSM shown in figure 4 where each of the stage requires

one clock-cycle to operate, it is straightforward that the core takes eight clock cycles to compute

FFT and present the output frequency samples in the output ports.

D. Verilog Design of the FFT Core in Xilinx ISE

The Verilog [2] module named fft8point whose block diagram is as shown in figure 8 computes

the radix-2 8-point FFT and the HDL code snippet for the ports declaration is as shown in figure

9. The port signal name and the description of each signal is shown in table 1. The eight input

samples are taken into the core using the input ports Px0 to Px7 each of which is 14-bit wide.

The core then evaluates the FFT of the eight time domain samples thus generating eight

frequency samples. These frequency samples are presented in the eight output ports X0 to X7

each of which is 64-bit wide where the upper 32-bits is the real part and the lower 32-bits is the

imaginary part of the output frequency samples.

Figure 8: Verilog Module of the FFT Core

Page 9: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 11

Figure 9: Verilog Code Snippet of the Core

Table 1: List of ports of the designed core with their function

S.N. Port

Name

Function Width (in bits) Direction

1. Px0 – Px7 Takes eight input samples from the Input

Buffer

14-bit each Input

2. X0 – X7 Present output samples 64-bit each with 32-

bit real part and 32-

bit imaginary part

Output

3. Clk Global Clock 1-bit Input

4. inputValid Asserts that the input samples at the input

port are valid

1-bit Input

5. Reset Global Reset signal for the core 1-bit Input

Page 10: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 12

Result, Discussion and Summary

A. Result and Discussion:

The functional verification of the FFT Core has been done by using the ISIM Simulator

of Xilinx ISE. The snapshot of the simulation result is shown in figure 9. The input

sequences fed to the FFT Core and the corresponding output samples generated along

with the comparison with actual output is shown in table 2. The execution period is

0.16us (= 8/50MHz). That means 8-point FFT is computed only in 8 clock-cycles.

Figure 10: Simulation Waveform showing the input samples, output samples, and control

signals.

The waveform shown in figure 9 illustrates that the designed core runs smoothly with correct

output. The result must be in floating point but because of the use of fixed-point representation

for the floating-point numbers, the obtained result is integer approximation of the actual result.

Page 11: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 13

The result is almost 100% accurate. Thus, the FFT Core is able to calculate the 8-point FFT with

a good precision.

Table 2: Comparison between the actual output from MATLAB calculation and computed

output from the designed core

S.N. Input Matlab Output Observed

Output

Error in

magnitude

(in Percentage)

100 + 0 i 1500 + 0 i 1500 + 0 i 0

2 200 + 0 i 300 + 200 i 300 + 200 i 0

3 300 + 0 i -541.4 – 724.3 i -542 – 725 i 0.102 %

4 400 + 0 i -258.6 – 124.3 i -259 -124 i 0.174 %

5 500 + 0 i 300 – 0 i 300 – 0 i 0

6 0 + 0 i 300 – 200 i 300 – 200 i 0

7 0 + 0 i -258.6 + 124.3 i -258 + 125 i 0.102 %

8 0 + 0 i -541.4 + 724.3 i -541 + 724 i 0.174 %

B. Design Summary

Design summary, a report generated by Xilinx ISE [3], allows designers to view various

information like targeted device, device utilization, design goal, etc. The implementation

of the FFT Core has been done in Spartan-3E Starter Board. The RTL schematic, which

is a basic logical representation of the circuit in terms of logic primitives which are

generated when the design become correct in simulation and synthesis level, of the FFT

processor is shown in figure 12. The design summary generated by the Xilinx is shown in

figure 11.

Page 12: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 14

Figure 11: Design Summary generated by Xilinx ISE

Figure 12: RTL Schematic of the Designed Core

Page 13: Design, Simulation, Implementation, and Performance Analysis of a  fixed-point 8 Point FFT Core for Real Time Application in Verilog  HDL

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)

www.ijars.in

Manuscript Id: iJARS/820 15

Summary and Conclusion

This paper presents 8-point FFT processor with a new architecture, which indeed has best

possible performance with the optimization in resource consumption. The whole design is

implemented in Verilog-HDL through Xilinx ISE 13.2 and the functional verification is done by

using ISIM simulator. The performance of our design presents better results in terms of both the

physical resources and throughput that is required for real time application as audio processing.

Along with these performance results come other considerations, which needs to be evaluated to

select the best approach depending on system requirements like easy implementation, costs and

performance. This design has a very simple port interface so that it can be easily incorporated

with any other system that requires FFT computation. The design produces a maximum error of

0.2% in the result due to the fixed-point representation of floating point values, which is accurate

to a very good tolerance limit. Another important note is that this core can be extended to

compute higher point FFT with a little modification. Further, this core can be a very useful tool

to analyze the frequency samples of any type of discrete time signals in real time.

References:

[1] Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck, Discrete-Time Signal

Processing, 2nd ed., Tenth Impression, Pearson Education, 2012, pp.655–681.

[2] J Bhasker, A Verilog HDL Primer, 3rd ed., Star Galaxy Publishing, 2005.

[3] Xilinx, April 2009, ISE In-Depth Tutorial (UG695 (v 12.1).

[4] Young-jin Moon, and Young-il Kim, “A Mixed-Radix 4-2 Butterfly with Simple Bit

Revering for Ordering the Output sequences,” ICA0T2006 vol. 4, pp. 1772–1774, February

2006.

[5] A. Sreir.sr, C. Ka-a-Terki, H. Mshrez, and S. Negus, “A Flexible High Perfomance Serial

Radix-2 fft Butterfly Arithmetic unit” IEEE Transl. J. Magn. Japan, vol. 2, pp. 26-29, August

1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982].

[6] Xilinx Logi Core FFT Processor guide.pdf.

[7] Chung-Ping Hung, Sau-Gee Chen and Kun-Lung Chen , “Design Of An Efficient Variable-

length FFT Processor”, ISCAS 2004,vol -4,pp.833-836.


Recommended