Post on 11-Nov-2014
description
transcript
This thesis proposes a design of Fast Fourier Transform and applies it into Nios II
Embedded Processor. This chapter covers the motivation, problem statement, project objectives,
and scope of work, project contributions and finally thesis organization. In last few decades,
embedded systems have experienced an accelerating growth both in computing power and scope
of their possible applications. Moreover the designing procedure for embedded system also
changed immensely. As the application demands goes on increasing with the time the complexity
of the embedded system is waxing. Combination of software and hardware in design leads to
improve the system performance such approach is known as Co-Design.
1.1 CO-DESIGN
Hardware/software co-design is the main technique used in the thesis. It can be defined as the
cooperative design of hardware and software. Co-design methodology deals with the problem of
designing heterogeneous systems. One of the goals of co-design is to shorten the time-to-market
while reducing the design effort and costs of the designed products. Co-design can be
implemented on embedded systems and processor is the main part of any embedded system. The
advantages of using processors are manifold, because software is more flexible and cheaper than
hardware. This flexibility of software allows late design changes and simplified debugging
opportunities. Furthermore, the possibility of reusing software by porting it to other processors
reduces the time-to market and the design effort. Finally, in most cases the use of processors is
very cheap compared to the development costs of ASICs, because processors are often produced
in high-volume, leading to a significant price reduction. However, hardware is always used by
the designer, when processors are not able to meet the required performance. This trade-off
between hardware and software illustrates the optimization aspect of the co-design problem. Co-
design is an interdisciplinary activity, bringing concepts and ideas from different disciplines
together, e.g. system-level modeling, hardware design and software design.
The design flow of the general co-design approach is depicted in figure 1
Step 1: The co-design process starts with specifying the system behavior at the system level.
Step 2: After this, a pure software system will be developed to verify all algorithms.
Step 3: Performance analysis will be performed to find out the system bottlenecks.
1
Step 4: The hardware/software partitioning phase a plan will be made to determine which parts
will realized by hardware and which parts will be realized by software. Obviously, some
system bottlenecks will be replaced by hardware to improve the performance.
Step 5: based on the results of step 4, hardware and software parts will be designed respectively.
Step 6: co-simulation. At this step, the completed hardware and software parts will be integrated
together and performance analysis will be performed.
Step 7: if the performance meets the requirements, the design can stop and if the Performance
can’t meet the requirements, new HW/SW partitioning and a new design.
Figure 1- the design flow of the general co-design approach
10
Hardware and softwarepartitioning & codesign
Integrationand test
Hardwarefab.
Softwarecode
Hardwaredesign
Softwaredesign
Hw and
Swparts
Systemdef.
Functiondesign
Primarily Virtual Prototype Primarily software hardware
1.2 FFT Algorithm
Fast Fourier Transform (FFT) algorithms are widely used in many areas of science and
engineering. Some of the most widely known FFT algorithms are Radix-2 algorithm, Radix-4
algorithm, Split Radix algorithm, Fast Hartley transformation based algorithm and Quick Fourier
transform. The Discrete Fourier Transform (DFT) is used to produce frequency analysis of
discrete non-periodic signals. The FFT is another method of achieving the same result, but with
less overhead involved in the calculations.
One of the most widely used technique in science and engineering is the concept of Fourier
Transform and other algorithms based on it. In signal processing, it is primarily used to convert
an input signal in time domain into frequency domain and vice-versa. In the world of digital,
signals are sampled in time domain. So, we have Discrete Fourier Transform (DFT) in the digital
world. DFT is applied on a discrete input signal and we get the frequency characteristics of the
signal as the output. Performing inverse DFT, which has a mathematical form very similar to the
DFT, on the frequency domain result gives back the signal in the time domain. This means that
the signal when converted into frequency domain will give us the various frequency components
of the input signal and then can be used to remove certain unwanted frequency components. This
concept can be used in image or audio compression and filters on communication signals to
name a few. Discrete Fourier Transform is a very computationally intensive process that is based
on summation a finite series of products of input signal values and trigonometric functions. Its
time complexity of the algorithm in O(n2).To increase the performance, several algorithms were
proposed which can be implemented in hardware or software. These set of algorithms are known
as Fast Fourier Transforms (FFT). The first major FFT algorithm was proposed by Cooley and
Tukey. Many FFT algorithms were proposed with a time complexity of O(nlogn). Some of them
are Radix-2 algorithm, Radix-4 algorithm and Split Radix algorithm. In this paper, we discuss
ways of parallelizing these algorithms to reduce the communication overhead.
1.3 Motivation
The Fast Fourier transform is a critical tool in digital signal processing where its value in
analyzing the spectral content of signals has found application in a wide variety of applications.
10
The most prevalent of these applications is being in the field of communications where the ever
increasing demand on signal processing capabilities have given rise to the importance of the
Fourier transform to the field.
However, the Fourier transform is a part of many systems in a wide variety of industrial and
research fields. Its uses range from signal processing for the analysis of physical phenomena to
analysis of data in mathematical and financial systems. The majority of systems requiring
Fourier transforms are real time systems which necessitate high speed processing of data. Given
the complexity in performing The Discrete Fourier, the implementation of high speed Fast
Fourier transform has required the use of dedicated hardware processors. The majority of high
performance Fourier transforms has required the use of full custom integrated circuits and has
typically been in the form of an application, specifically integrated circuit. Although much work
has been put into raising performance while reducing hardware requirements, and also cost, the
cost of full custom hardware still limits the availability of Fourier transform hardware to low
volume production. Nevertheless the development of programmable logic hardware has
produced devices that are increasingly capable of handling large scale hardware. High density
field programmable gate arrays (FPGA) that are already available in the market can boast
upwards of 180,000 logic elements, nine megabits of memory, and on board processors. The use
of FPGA in implementing hardware eliminates the need for the long and costly process of
creating a full custom integrated circuit and the time and cost of testing and verification. Saving
cost in designing, testing, and time from design to a functional device. These features of the
FPGA make it especially attractive for the purpose of creating embedded processors for research
and development purposes. However the design of any of embedded processors must consider
two important factors efficiency and flexibility for reaching an ideal design.
1.4 Problem Statement
Efficiency and flexibility are two of the most important driving factors in embedded system
design. Efficient implementations are required to meet the tight cost, timing, and power
constraints present in embedded systems. Flexibility, albeit tough to quantify, is equally
important; it allows system designs to be easily modified or enhanced in response to bugs,
evolution of standards, market shifts, or user requirements, during the design cycle and even
10
after production. Various implementation alternatives for a given function, ranging from custom-
designed hardware to software running on embedded processors, provide a system designer with
differing degrees of efficiency and flexibility. Unfortunately, it is often the case that these are
conflicting design goals. While efficiency is obtained through custom hardwired
implementations, flexibility is best provided through programmable implementations.
Hardware/software partitioning separating a system’s functionality into embedded software
(running on programmable processors) and custom hardware (implemented as coprocessors or
peripheral units) is one approach to achieve a good balance between flexibility and efficiency.
1.5 Project Objectives
The aims of this project are as follow:
Design and implementation of Fast Fourier Transform (FFT) algorithm into embedded system
a) Utilizing Nios II embedded processor.
b) Implementation of FFT algorithm using NIOS II Processor without Custom Instruction.
c) Implementation of FFT algorithm using NIOS II Processor with Custom Instruction.
d) Comparison of algorithm in terms of speed and area for both the designs of FFT
algorithm with and without custom instruction
1.6 Scope of Work
Taking into account the resources and time available, this project is narrowed down to the
following scope of work.
1. This project only considers 16 point FFT floating point. The Decimation-In- Time (DIT)
algorithm is chosen.
2. The algorithm is implemented in C++ language.
3. Floating Point Custom Instruction is targeted for Nios II platform and implemented in
ALTERA Cyclone II DE2 board.
4. NIOS II IDE User Interface (GUI) has been used for the purpose of interfacing with
FPGA hardware to provide inputs and display outputs.
5. Universal Serial Bus (USB) is used for transmitting and receiving data between FPGA
board.
10
6. This Embedded system is applied in Spectral Analysis as an application.
1.7 Project Contributions
The most important contributions of this project are:
1. Integration framework of ALTERA development kit platform.
2. Utilizing Nios II Floating Point Custom Instruction in the design to increase performance
and accelerate speed.
3. Created a simple protocol that is used for interaction with and communication between
hardware and software via computer serial port.
1.8 Thesis Outline
The thesis is organized into 6 chapters. The first chapter (this chapter) presents the background
of the work, problem statement, research objectives, work scope and contributions of this
project. Thesis is organized into nine chapters along with references and appendix.
Chapter 1 presents a basic overview of need for an embedded system, about co-design, FFT
algorithm it also contains motivation and proposed work for the project. The last section presents
the outline of the thesis.
Chapter 2 presents literature survey about concept of soft IP core and its use for embedded
system. It also presents the survey of different IP cores like Nios II, FFT Algorithm etc...
Chapter 3 presents an overview of Methodology, Flow chart, Algorithm, and system
designing using NIOS II processor.
Chapter 4 describes an introduction of the Fast Fourier Transform. A derivation of the FFT is
given and concentrated to radix-2 algorithm and also describes complete system description
Embedded System is first explained, followed by methodology, Nios II Floating Point Custom
Instruction, and finally implementation of FFT algorithm in C++ language.
10
Chapter 5 shows the system results and Nios II results. All results are appraised and
compared. Conclusion, which summarizes the work in this thesis.
Chapter 6 future works is also proposed, which essentially suggests ways to Improve and
extend the current design.
Embedded systems are hardware and software components working together to perform a
specific function. To design an embedded system processor should be choose a core based on the
requirements and performance constraints of their particular application. Each core has different
performance characteristics and features that are suitable for specific applications. Survey of
different cores is done in following chapter.
10
2.1 IP core
In information technology, design reuse is the inclusion of previously designed components
in software and hardware. This term is more frequently used in hardware development.
Developers can reuse a component in both similar and completely different applications, for
example a component used as part of a central processing unit for a PC could be reused in a
handheld device or a set-top box. Thus an IP core is defined as a pre-defined, pre-verified
complex functional block that is integrated into the logic of particular design. In electronic
design an IP core is a reusable unit of logic, cell, or chip layout design that is the intellectual
property of one party and is used in making a FPGA or ASIC for a product.
An IP (intellectual property) core is a block of logic or data that is used in making a field
programmable gate array ( FPGA ) or application-specific integrated circuit ( ASIC ) for a
product. As essential elements of design reuse , IP cores are part of the growing electronic design
automation ( EDA ) industry trend towards repeated use of previously designed components.
Ideally, an IP core should be entirely portable - that is, able to easily be inserted into any vendor
technology or design methodology. Universal Asynchronous Receiver/Transmitter ( UART s),
central processing units ( CPU s), Ethernet controllers, and PCI interfaces are all examples of IP
cores. One of the most important product development decisions facing SOC designers today is
choosing an intellectual property (IP) core. It can impact product performance and quality, as
well as time-to-market and profitability. But SOC designers face many challenges when
choosing a core. Determining which core is most appropriate for a given SOC requires careful
consideration. Decisions must be made about the type of core (soft vs. hard), the quality of the
deliverables, and the reliability and commitment of the IP provider. Continuing improvements in
silicon manufacturing technology have made vast amounts of silicon real estate available to
10
today’s design engineers. Unfortunately, the ability of engineering teams to design circuits has
not kept pace. This imbalance has spawned the IP core industry. IP cores allow design teams to
rapidly create large system-on-a-chip designs (SOCs) by integrating pre-made blocks that do not
require any design work or verification. A number of difficult challenges accompany this new
design style. Depending on the core, they can be minimized or exacerbated. First of all, IP cores
may be delivered to customers in one of two forms: soft or hard. In both cases, the customer
receives a functionally verified design. A soft core, also known as a synthesizable core, is
synthesized by the customer and implemented in its SOC. A hard core, on the other hand, is fully
implemented and ready for manufacturing. (Technically, a design is not implemented until it is
manufactured. In this context, however, implemented means laid-out and ready for
manufacturing.) The SOC team need only drop the hard core into the chip as a single monolithic
piece. Soft and hard cores have different problems and benefits, which are addressed below. An
IP core jump-starts a key part of the SOC design task. The design team gets a verified design,
which enables them to complete their chip in less time with fewer engineering and EDA
resources. However, integrating a core into a chip requires many steps. How easily this is
accomplished, if at all, depends on the deliverables provided. This paper details some of the
collateral deliverables that enable easy core integration into all stages of the SOC design process.
Finally, there is the IP vendor to consider. The IP industry is still young and there have been a
number of poor products and even some failures, and they have not been confined to start-ups.
Consequently, a customer must evaluate not only the IP core, but also the IP provider.
2.1.1 Types of IP core
Cores can be classified in three categories: hard, firm and soft.
Soft cores
Synthesizable behavior description of complete microprocessor in hardware description
language like VHDL or Verilog is called soft core. HDL is analogous to high level
language such as C in the field of computer programming. IP cores delivered to chip
makers as RTL permit chip to modify designs (at the functional level).
Hard cores
10
Hard core is generally defined as a lower-level, physical description provided in any of a
variety of physical layout file format. These layouts must obey the target foundry's
process design rules, and hence, hard cores delivered for one foundry's process cannot be
easily ported to a different process or foundry. Such cores, whether analog or digital, are
called ’hard cores’, because the core’s application function cannot be meaningfully
modified by the customer.
Soft vs. Hard Cores
Let’s examine the pros and cons:
Because soft cores are not implemented, they are inherently more flexible in function and
implementation than hard cores. On the other hand, hard core developers can afford to
spend more time optimizing their implementations because they will be used in many
designs. Thus, there is a perception that hard cores offer higher performance.
In fact, high-end, full-custom hard cores designed for the most advanced processes do
offer more performance than soft cores. By using latches, dynamic logic, 3-state signals,
custom memories, and so on, the full-custom design team can achieve much better results
than a fully static synthesized design. For an SOC that requires performance that pushes
the limits of current process and design technology, a full-custom hard core is better able
to meet these needs. However, if the performance target is within the range of a soft core,
then the potential performance advantage of a hard core is immaterial. The SOC design
team can meet its performance goals with a soft core while taking advantage of its
inherent flexibility. (As process technology improves, the maximum frequency limits of
soft cores will also improve, making them an option for even more SOC designs.)
Even at slower clock frequencies, a hard core may offer an advantage in terms of silicon
area. But this is not always true. Often, a hard core is simply hardened using an ASIC-
style methodology, which offers no advantage in area of speed. In other cases, a full-
custom core is not re-optimized for each process generation, thus diminishing its
frequency and area advantages.
Technology Independence & Portability
10
One of the advantages of a soft core is that it is technology independent. That is to say,
the high level Verilog or VHDL does not require the use of a specific process technology
or standard cell library. This means that the same IP core can be used for multiple
designs, or for future generations of the current design. (Some soft core IP providers use
design styles that make their cores technology-dependent, but the advantages of this
approach are unclear.) A hard core, on he other hand, is very technology-specific. In fact,
if a foundry changes its process parameters or library factors, a hard core may not work
correctly with the process tweaks. This introduces risk since the IP provider will need to
re-verify the hard core when process parameters are changed. Hard cores can be ported to
new process technologies, but the effort to re-optimize full-custom cores is significant
and costly. It may take two years or more for some advanced microprocessor cores.
Because of this, hard cores are often optically scaled for new processes. While simple and
fast, this technique diminishes many of the advantages of the full-custom optimizations
done by the design team for the original process.
Furthermore, optical scaling introduces additional risk, since it only guarantees that the
new design meets design rules. It does not guarantee correct timing or function. Since the
optical scaling is a short-cut design style, it can be very difficult to fully re-verify an
optically scaled IP core. In reality, soft cores are likely designed with one technology and
library in mind. The design itself is independent of this choice of technology but it
optimized for this one technology and library. Similar technologies will be near-optimal,
but some significantly different technologies (for instance, ones with very slow RAMs)
may not see equivalent results. However, this effect is secondary. Soft cores will
generally be better optimized than optically scaled hard cores.
Speed/Area/Power Optimization
Hard cores are optimized once, when they are implemented by the IP provider. Because
the core is optimized only once, the IP provider can afford to spend significant resources.
Thus, a hard core will typically run faster than a comparable soft core for that one
technology in which it is implemented. But, even in that single technology, it is only
optimized for one set of goals. If the goal is low area at reasonable performance, the
10
highly tuned performance-optimized hard core may be too large for the application. Soft
cores, on the other hand, can be “application optimized”: Timing, area and power targets
can be adjusted to fit the specific embedded SOC design. For instance, if an SOC uses a
200- MHz clock, then a soft IP core that was designed to run at 250 MHz can be targeted
to run at exactly 200 MHz instead. This allows for smaller area and lower power while
still meeting the design constraints. This application optimization also extends down to
low-level IO timing. The IO constraints of a soft core can be adjusted to exactly fit the
environment the core will be used in. If a hard core has a late output signal, there is little
the SOC designer can do to improve that timing. If an SOC’s speed, area and power
targets are exactly what the hard core was targeted for, then that hard core will be
competitive. For the great majority of designs, however, a soft core will be better
optimized for that particular SOC.
Soft cores offer another advantage over hard cores: compile-time customizations. These
are design options chosen prior to implementation. Cache memory size is a common
compile-time customization. A soft-core processor can be configured for exactly the
cache size needed by the specific embedded application. A hard core, on the other hand,
cannot be customized in this way.
Another customization employed in many soft cores is instruction specialization, or
optional support for certain instructions. For example, support for external coprocessors
may be IP cores are released from IP core provider either as soft core, firm core or hard
cores depending on the evel of changes that the SoC designer (also called IP cores user)
can make to them, and level of transparency they come with when delivered to the final
SoC integrator. A soft core consists of a synthesizable HDL description that can be
synthesized into different semiconductor processor and design libraries. A firm core
contains more structural information, usually a gate level net list that is ready for
placement and routing. A hard core includes layout and technology depending timing
information and is ready to be dropped into a system but no changes are allowed to it.
Hard cores have usually smaller cost as they are final plug-and –play design implemented
in specific technology library and no changes are allowed in them. Tremendous verity of
IP cores of all types and functionalities is available to SoC designr. Therefore, designer
10
are given the great advantages to select from a rich pool of well-designed and carefully
verified cores and integrate them , in a plug-and-play fashion, in the system thy are
designing
Core category Changes Cost `Description
Soft core Many High HDL
Firm coore Some Medium Netlist
Hard core No Low Layout
Table 1- Evaluating the IP Provider
Evaluating the IP Provider
There are many companies that offer IP cores. Many are small, start-up design houses,
and some are large, well-established companies using IP cores as a new method of
delivering their designs to customers. Unfortunately, the size of a company is not an
indicator of IP core quality. The CO designer should verify the commitment a company
has made to IP core products.
For example, an IP provider that is not completely committed to IP cores has offerings
that may only be previous designs repackaged as IP cores. A company that is serious
about building high quality cores will design them from scratch with reuse in mind. This
section details some hallmarks of designs made to be reusable. First of all, watch out for
soft cores that are the source code for a full-custom hard core. Since these designs were
not originally made to be synthesizable, they will be poor products when compared to
those designed to be synthesizable. When making a hard core, optimizations can be made
based on the known implementation style. However, in a soft core, the implementation is
not yet done, so these shortcuts should not be taken since they may result in non-
functional or sub-optimal implementations. Another thing to look for in a soft core is
registered interface signals. By registering IOs, internal logic can be timing-independent
from anything the SOC team hooks up. Furthermore, it enables easy timing predictability
and gives very good timing constraints to the SOC designer. All of these things make 10
SOC design easier. A soft core that was designed from the beginning to be reusable will
typically have more configuration choices and more flexibility in implementation. It will
also likely be delivered with multiple design environments in mind. A design made
without reuse in mind will be less flexible in function and implementation. An IP core
with poor deliverables can also be difficult to integrate into an SOC flow. Therefore, it is
very important to evaluate the IP core deliverables to make sure the correct EDA tools
are supported and all steps of the SOC flow can be addressed properly. The choice of the
IP provider is perhaps as important as the choice of the IP core itself. An IP provider that
is making a significant commitment to IP cores is an absolute necessity. Furthermore, the
SOC team needs to know that the IP provider will be there in the future to support the
product as well as to introduce the new products. There are many challenges facing
today’s SOC designer. Using a high-quality IP core from a reputable company should
make those challenges easier, not more difficult.
Firm Core
It is structural description of a component provided in HDL. Like the hard cores, firm
core sometimes called semi-hard cores. It also carry placement data but are configurable
to various applications. It provides some retarget ability and some limited optimization.
2.1.2 Sources of IP cores
There are two sources, commercial and open source. IP core available under license
version are commercial cores. Some of the well-known commercial cores are Altera’s
Nios II, Xilinx’s MicroBlaze and PicoBlaze and Tensilica’s Xtensa. An open source code
of soft-core processor is freely available under the GNU (L) GPL license and can be
downloaded across the internet. These are LEON by Gaisler Research and Open RISC
1000 from opencores.org.
2.1.3 Comparison of soft IP cores
As part of an extensive library of cores, Altera developed Nios, a processor specifically
designed for programmable logic and system-on-a-programmable chip integration. The
Nios processor is a pipelined, single-issue RISC processor in which most instructions run
in a single clock cycle. There are two versions available, one with a native 16-bit word 10
size and one with a native 32-bit word size. There is a development kit available which
includes a C/C++ compiler, debugger, assembler, as well as other development utilities.
It also supports operating systems such as c/OS, Linux, nucleus etc [6].
Xtensa is Tensilica's best known processor IP architecture. Tensilica's Xtensa architecture
offers a key differentiating feature of a user-customizable instruction set. Using the
supplied customization tools, customers can extend the Xtensa base instruction-set by
adding new user-defined instructions. After the final processor configuration is made and
submitted, Tensilica's processor generator service builds the configured Xtensa IP core,
processor design kit, and software development kit. The software kit is built on the
Eclipse-based integrated development environment, and uses a GNU derived tool-chain.
An instruction set simulator enables customers to begin application development before
actual hardware is available.
In response to the development of soft-core processors, Xilinx introduced the
MicroBlaze processor. The Micro Blaze processor is a 32-bit RISC processor that
supports both 16-bit and 32-bit busses and supports Block Ram and/or external memory.
All peripherals including the memory controller, the UART, and the interrupt controller
run off a standard OPB bus. Additional processor performance can be achieved by
exploiting Virtex-II architecture features such as the embedded multiplier and ALU.
Xilinx also provides GNU-based tools, including a C-compiler, a debugger, and an
assembler, as well as all of the standard libraries.
The fully synthesizable code of the Open RISC 1000 processor is freely available and
was designed with an emphasis on scalability and platform independence. The
architecture consists of a 32-bit RISC Integer Unit with a configurable number of
general-purpose registers, configurable cache and TLB sizes, dynamic power
management support, and space for user provided instructions a complete GNU-based
development environment is available and includes a C-compiler, assembler, linker,
debugger, and simulator.
The LEON3 processor core is a synthesizable VHDL model of a 32-bit processor
compliant with the SPARC V8 architecture. The core is highly configurable and
10
particularly suitable for system-on-a-chip designs. The core is interfaced using the
AMBA 2.0 AHB bus and supports the IP core plug & play method provided in the
Gaisler Research IP library. The processor can be efficiently implemented on FPGA [4].
3.1 CO-DESIGN METHODOLOGY
The activity of co-design is interchangeable with that of personalization when a service user and
a provider set about creating the desired service for an individual. If an overarching objective is
to make services more personalized, than a huge amount can be learnt about the design of the
activity of personalization by working with a small number of customers.
Through such work, service designers gain insights on how people see and communicate their
needs, how they perceive the role of the provider and the requirements of the support they need.
The practical tools for personalization can be designed with those who will need to use them.
Co-design is a very public and visible process. As uncomfortable as it can feel, transparency
through greater collaboration is key to both managing expectations early and getting honest and
accurate (and therefore useful) outputs. The scale of this openness needs to be managed carefully
10
as scrutiny by too many can mean that political assuagement or appeasement overrides the
careful crafting of a solution emergent through the process of designing.
Co-design has challenged many professional designers because the idea of allowing anybody to
have a go is seen as a threat to quality as well as a denial of skill and talent. One view is that
there is some truth in this, but it’s often the case that those expressing such a concern are basing
their view on a conventional understanding of what design does, and an unclear picture of an
emerging role for design and designers.
The lone designer can solve simple problems and give form to solutions, but complex challenges
demand collaborative platforms and projects. It’s also worth remembering that tangible and
elegant solutions still need to be designed and this is the unique contribution of trained designers.
A belief is that professional designers are valuable in new ways and not to the detriment of what
designers have always done well. However the activity of designing responses to complex
challenges is too important to leave only to designers.
10
3.2FlowChart
10
Figure2- Work Flow
3.3 Algorithm
1. FFT Algorithms
2. Create the Embedded System into Altera QUARTUS II 10.1 system programmable-on-
chip (SOPC) Builder
3. Implementation of FFT Algorithms in C++ language in NIOS II IDE.
4. Create the serial connection between the Hardware and the Software.
5. Upload the system without custom instruction in FPGA (EP2C35F672C6)
6. Upload the system with custom instruction in FPGA (EP2C35F672C6)
7. Compare the results of both the systems.
3.4 System using NIOS II processor
In this project the embedded system is generated, for generation of this system NIOS II soft core
processor is used. The generation of the system using NIOS II processor is done in this chapter.
3.4.1 Introduction
The NIOS II processor is a general-purpose RISC processor core with the following
Features:
Full 32-bit instruction set, data path, and address space
32 general-purpose registers
Optional shadow register sets
32 interrupt sources
External interrupt controller interface for more interrupt sources
Single-instruction 32 × 32 multiply and divide producing a 32-bit result
Dedicated instructions for computing 64-bit and 128-bit products of multiplication
Floating-point instructions for single-precision floating-point operations
Single-instruction barrel shifter.
Access to a variety of on-chip peripherals, and interfaces to off-chip memories and
peripherals.
10
Hardware-assisted debug module enabling processor start, stop, step, and trace Under
control of the NIOS II software development tools.
Optional memory management unit (MMU) to support operating systems that
require MMUs.
Optional memory protection unit (MPU)
Software development environment based on the GNU C/C++ tool chain and the
NIOS II Software Build Tools (SBT) for Eclipse
Integration with Altera's Signal Tap® II Embedded Logic Analyzer, enabling Real-
time analysis of instructions and data along with other signals in the FPGA design.
Instruction set architecture (ISA) compatible across all NIOS II processor systems.
Performance up to 250 DMIPS
A Nios II processor system is equivalent to a microcontroller or “computer on a chip”
that includes a processor and a combination of peripherals and memory on a single chip.
A Nios II processor system consists of a Nios II processor core, a set of on-chip
peripherals, on-chip memory, and interfaces to off-chip memory, all implemented on a
single Altera device. Like a microcontroller family, all Nios II processor systems use a
consistent instruction set and programming model.
The Nios II processor is a configurable soft IP core, as opposed to a fixed, off-the-shelf
microcontroller. We can add or remove features on a system-by-system basis to meet
performance or price goals. Soft means the processor core is not fixed in silicon and can
be targeted to any Altera FPGA family. Altera provides ready-made Nios II system
designs that can use as is. If these designs meet your system requirements, there is no
need to configure the design further. In addition, we can use the Nios II instruction set
simulator to begin writing and debugging Nios II applications before the final hardware
configuration is determined. Nios II is offered in 3 different configurations: Nios II/f
(fast), Nios II/s (standard), and Nios II/e (economy).
Nios II/f
10
The Nios II/f core is designed for maximum performance at the expense of core size.
Features of Nios II/f include:
Separate instruction and data caches (512 B to 64 KB)
Optional MMU or MPU
Access to up to 2 GB of external address space
Optional tightly coupled memory for instructions and data
Six-stage pipeline to achieve maximum DMIPS/MHz
Single-cycle hardware multiply and barrel shifter
Optional hardware divide option
Dynamic branch prediction
Up to 256 custom instructions and unlimited hardware accelerators
JTAG debug module
Optional JTAG debug module enhancements, including hardware breakpoints, data
triggers, and real-time trace
Nios II/s
Nios II/s core is designed to maintain a balance between performance and cost. Features
of Nios II/s include:
Instruction cache
Up to 2 GB of external address space
Optional tightly coupled memory for instructions
Five-stage pipeline
Static branch prediction
Hardware multiply, divide, and shift options
Up to 256 custom instructions
JTAG debug module
Optional JTAG debug module enhancements, including hardware breakpoints, data
triggers, and real-time trace
10
Nios II/e
The Nios II/e core is designed for smallest possible logic utilization of FPGAs. This is
especially efficient for low-cost Cyclone II FPGA applications. Features of Nios II/e
include:
Up to 2 GB of external address space
JTAG debug module
Complete systems in fewer than 700 LEs
Optional debug enhancements
Up to 256 custom instructions
3.4.2 NIOS II Architecture
The Nios II architecture is a RISC soft core architecture which is implemented entirely in
the programmable logic and memory blocks of Altera FPGAs. The soft-core nature of the
Nios II processor lets the system designer specify and generate a custom Nios II core,
tailored for his or her specific application requirements. System designers can extend the
Nios II's basic functionality by adding a predefined memory management unit, or
defining custom instructions and custom peripherals.
The NIOS II architecture describes an instruction set architecture (ISA). The ISA in turn
necessitates a set of functional units that implement the instructions. A NIOS II processor
core is a hardware design that implements the Nios II instruction set and supports the
functional units described in this document. The processor core does not include
peripherals or the connection logic to the outside world. It includes only the circuits
required to implement the NIOS II architecture.
10
Figure3- NIOS II Core Block Diagram
The NIOS II architecture defines the following functional units:
Register file
Arithmetic logic unit (ALU)
Interface to custom instruction logic
Exception controller
Internal or external interrupt controller
Instruction bus
Data bus
Memory management unit (MMU)
Memory protection unit (MPU)
Instruction and data cache memories
Tightly-coupled memory interfaces for instructions and data
JTAG debug module
3.4.3 Processor Implementation
The functional units of the Nios II architecture form the foundation for the Nios II
instruction set. However, this does not indicate that any unit is implemented in hardware.
The Nios II architecture describes an instruction set, not a particular hardware
implementation. A functional unit can be implemented in hardware, emulated in 10
software, or omitted entirely. A Nios II implementation is a set of design choices
embodied by a particular Nios II processor core. Each implementation achieves specific
objectives, such as smaller core size or higher performance. This flexibility allows the
Nios II architecture to adapt to different target applications.
Implementation variables generally fit one of three trade-off patterns: more or less of a
feature; inclusion or exclusion of a feature; hardware implementation or software
emulation of a feature. An example of each trade-off follows:
More or less of a feature—for example, to fine-tune performance, you can increase
or decrease the amount of instruction cache memory. A larger cache increases
execution speed of large programs, while a smaller cache conserves on-chip memory
resources.
Inclusion or exclusion of a feature—For example, to reduce cost, you can choose to
omit the JTAG debug module. This decision conserves on-chip logic and memory
resources, but it eliminates the ability to use a software debugger to debug
applications.
Hardware implementation or software emulation—For example, in control
applications that rarely performs complex arithmetic, you can choose for the division
instruction to be emulated in software. Removing the divide hardware conserves on-
chip resources but increases the execution time of division operations.
3.4.4 Register File
The NIOS II architecture supports a flat register file, consisting of thirty-two 32-bit
general-purpose integer registers, and up to thirty-two 32-bit control registers. The
architecture supports supervisor and user modes that allow system code to protect the
control registers from errant applications. The NIOS II processor can optionally have one
or more shadow register sets. A shadow register set is a complete set of NIOS II general-
purpose registers. When shadow register sets are implemented, the CRS field of the status
register indicates which register set is currently in use. An instruction access to a general-
purpose register uses whichever register set is active. Typical use of shadow register sets
10
is to accelerate context switching. When shadow register sets are implemented, the NIOS
II processor has two special instructions, rdprs and wrprs, for moving data between
register sets. Shadow register sets are typically manipulated by an operating system
kernel, and are transparent to application code. A Nios II processor can have up to 63
shadow register sets.
3.4.5 Arithmetic Logic Unit
The Nios II ALU operates on data stored in general-purpose registers. ALU operations
take one or two inputs from registers, and store a result back in a register as shown in
table no.1
In hardware implementation custom instruction maps the processes in software such as
addition, subtraction, multiplication & division directly to the ALU of NIOS II processor
which is used directly as hardware. Inclusion of such hardware reduces the clock cycles
and time required for execution of algorithm.
Category Details
Arithmetic The ALU supports addition, subtraction, multiplication and division on
signed and unsigned operators.
Relational The ALU supports the equal, not-equal,greater-than-or-equal,and less-
than relational operators(==,!=,>=,<)on signed and unsigned operators.
Logical The ALU supports AND, OR, NOR and XOR logical operations.
Shift and
Rotate
The ALU supports shift and rotate operations. And can shift/rotate data y
0 to 31 bit positions per instruction. The ALU supports arithmetic shift
and logical shift right/left. The ALU supports rotate shift right/left .
Table 2- Operations Supported by the Nios II ALU
3.4.6 Exception and Interrupt Controllers
The NIOS II processor includes hardware for handling exceptions, including hardware
interrupts. It also includes an optional external interrupt controller (EIC) interface. The
10
EIC interface enables you to speed up interrupt handling in a complex system by adding a
custom interrupt controller.
3.5 Memory and I/O Organization
This section explains hardware implementation details of the NIOS II memory and I/O
organization. The discussion covers both general concepts true of all NIOS II processor systems,
as well as features that might change from system to system. The flexible nature of the NIOS II
memory and I/O organization are the most notable difference between NIOS II processor
systems and traditional microcontrollers. Because Nios II processor systems are configurable, the
memories and peripherals vary from system to system. As a result, the memory and I/O
organization varies from system to system.
A Nios II core uses one or more of the following to provide memory and I/O access:
Instruction master port—An Avalon® Memory-Mapped (Avalon-MM) master port that
connects to instruction memory via system interconnect fabric
Instruction cache—Fast cache memory internal to the Nios II core
Data master port—An Avalon-MM master port that connects to data memory and
peripherals via system interconnect fabric
Data cache—Fast cache memory internal to the Nios II core
Tightly-coupled instruction or data memory port—Interface to fast on-chip memory
outside the Nios II core.
3.5.1 Instruction and Data Buses
The NIOS II architecture supports separate instruction and data buses, classifying it as
Harvard architecture. Both the instruction and data buses are implemented as Avalon-
MM master ports that adhere to the Avalon-MM interface specification. The data master
port connects to both memory and peripheral components, while the instruction master
port connects only to memory components.
3.5.2 Memory and Peripheral Access
10
The NIOS II architecture provides memory-mapped I/O access. Both data memory and
peripherals are mapped into the address space of the data master port. The Nios II
architecture uses little-endian byte ordering. Words and half words are stored in memory
with the more-significant bytes at higher addresses. The Nios II architecture does not
specify anything about the existence of memory and peripherals; the quantity, type, and
connection of memory and peripherals are system-dependent. Typically, Nios II
processor systems contain a mix of fast on-chip memory and slower off-chip memory.
Peripherals typically reside on-chip, although interfaces to off-chip peripherals also exist.
3.5.3 Instruction Master Port
The NIOS II instruction bus is implemented as a 32-bit Avalon-MM master port. The
instruction master port performs a single function: it fetches instructions to be executed by
the processor. The instruction master port does not perform any write operations. The
instruction master port is a pipelined Avalon-MM master port. Support for pipelined
Avalon-MM transfers minimizes the impact of synchronous memory with pipeline latency
and increases the overall fMAX of the system. The Nios II processor can prefetch sequential
instructions and perform branch prediction to keep the instruction pipe as active as possible.
The instruction master port always retrieves 32 bits of data. The instruction master port
relies on dynamic bus-sizing logic contained in the system interconnect fabric. By virtue of
dynamic bus sizing, every instruction fetch returns a full instruction word, regardless of the
width of the target memory.
3.5.4 Data Master Port
The NIOS II data bus is implemented as a 32-bit Avalon-MM master port. The data master
port performs two functions:
Read data from memory or a peripheral when the processor executes a load Instruction.
Write data to memory or a peripheral when the processor executes a store Instruction.
3.5.5 Cache Memory
The NIOS II architecture supports cache memories on both the instruction master port
(instruction cache) and the data master port (data cache). Cache memory resides on-chip as
10
an integral part of the Nios II processor core. The cache memories can improve the average
memory access time for Nios II processor systems that use slow off-chip memory such as
SDRAM for program and data storage. The instruction and data caches are enabled
perpetually at run-time, but methods are provided for software to bypass the data cache so
that peripheral accesses do not return cached data. Cache management and cache coherency
are handled by software. The Nios II instruction set provides instructions for cache
management.
3.5.6 Tightly-Coupled Memory
Tightly-coupled memory provides guaranteed low-latency memory access for
performance-critical applications. Physically, a tightly-coupled memory port is a separate
master port on the NIOS II processor core, similar to the instruction or data master port.
Compared to cache memory, tightly-coupled memory provides the following benefits:
Performance similar to cache memory.
Software can guarantee that performance-critical code or data is located in
Tightly-coupled memory.
No real-time caching overhead, such as loading, invalidating, or flushing memory.
3.5.7 Address Map
The address map for memories and peripherals in a Nios II processor system is design
dependent. You specify the address map in Qsys and SOPC Builder. There are three
addresses that are part of the processor and deserve special mention:
Reset address
Exception address
Break handler address
Programmers access memories and peripherals by using macros and drivers. Therefore, the
flexible address map does not affect application developers.
10
3.5.8 Memory Management Unit
The optional NIOS II MMU provides the following features and functionality:
Virtual to physical address mapping
Memory protection
32-bit virtual and physical addresses, mapping a 4-GB virtual address space into
as much as 4 GB of physical memory
4-KB page and frame size
Low 512 MB of physical address space available for direct access
Hardware translation look aside buffers (TLBs), accelerating address translation Separate
TLBs for instruction and data accesses Read, write, and execute permissions controlled
per page Default caching behavior controlled per page TLBs acting as n-way set-
associative caches for software page tables TLB sizes and associativities configurable in
the Nios II Processor parameter editor
Format of page tables (or equivalent data structures) determined by system software
Replacement policy for TLB entries determined by system software
Write policy for TLB entries determined by system software
3.5.9 Memory Protection Unit
The optional NIOS II MPU provides the following features and functionality:
Memory protection
Up to 32 instruction regions and 32 data regions
Variable instruction and data region sizes
Read and write access permissions for data regions
Execute access permissions for instruction regions
Overlapping regions
3.6. JTAG Debug Module
10
The Nios II architecture supports a JTAG debug module that provides on-chip emulation features
to control the processor remotely from a host PC. PC-based software debugging tools
communicate with the JTAG debug module and provide facilities, such as the following features:
Downloading programs to memory
Starting and stopping execution
Analyzing registers and memory
Collecting real-time execution trace data
3.7 Embedded system generation using NIOS II processor
The Nios II development flow consists of three types of development: hardware design steps,
software design steps, and system design steps, involving both hardware and software. System
design steps involve both the hardware and software, and might require input from both sides.
Figure4-Flow chart of System generation using Nios II processor
3.8 Defining and Generating the System in SOPC Builder
After analyzing the system hardware requirements, SOPC Builder is used to specify the Nios II
processor core(s), memory, and other components your system requires. SOPC Builder
automatically generates the interconnect logic to integrate the components in the hardware
system. It can be selected from a list of standard processor cores and components provided with
10
the Nios II EDS. Custom hardware can also be added to accelerate system performance. Custom
instruction logic to the Nios II core can be added which accelerates CPU performance, or a
custom component can be added which offloads tasks from the CPU. The following step covers
adding standard processor and component cores.
The primary outputs of SOPC Builder are the following file types:
SOPC Builder Design File (.sopc) — it contains the hardware contents of the SOPC
Builder system.
SOPC Information File (.sopcinfo) — it contains a human-readable description of the
contents of the .sopc file. The Nios II EDS uses the .sopcinfo file to compile software for
the target hardware.
Hardware description language (HDL) files—are the hardware design files that describe
the SOPC Builder system. The Quartus II software uses the HDL files to compile the
overall FPGA design into an SRAM Object File (.sof).
3.8.1 Integrating the SOPC Builder System into the Quartus II Project
After generating the Nios II system using SOPC Builder, it is integrated into the Quartus
II project. Using the Quartus II software, all tasks are performed required to create the
final CYCLONE II FPGA hardware design. Using the Quartus II software, assign pin
locations for I/O signals, specify timing requirements, and apply other design constraints.
Finally, compile the Quartus II project to produce a .sof to configure the CYCLONE II
FPGA. Download the .sof to the CYCLONE II FPGA on the target board
(EP2C35F672C6) using an Altera download cable, such as the USB-Blaster™. After
configuration, the FPGA behaves as specified by the hardware design, which in this case
is a Nios II processor system. The Nios II processor and the interfaces needed to connect
to other chips on the DE2 board are implemented in the Cyclone II FPGA chip. These
components are interconnected by means of the interconnection network called the
Avalon Switch Fabric. Memory blocks in the Cyclone II device can be used to provide an
on-chip memory for the Nios II processor. They can be connected to the processor either
10
directly or through the Avalon network. The SRAM and SDRAM memory chips on the
DE2 board are accessed through the appropriate interfaces.
Input/output interfaces are instantiated to provide connection to the I/O devices used in
the system. A special JTAG UART interface is used to connect to the circuitry that
provides a Universal Serial Bus (USB) link to the host computer to which the DE2 board
is connected. This circuitry and the associated software is called the USB-Blaster.
Another module, called the JTAG Debug module, is provided to allow the host computer
to control the Nios II processor. It makes it possible to perform operations such as
downloading programs into memory, starting and stopping execution, setting program
breakpoints, and collecting real-time execution trace data. Since all parts of the Nios II
system implemented on the FPGA chip are defined by using a hardware description
language, a knowledgeable user could write such code to implement any part of the
system.
4.1 Fast Fourier Transform
In this chapter, several methods for computing the Discrete Fourier Transform (DFT) efficiently
are presented. In view of the importance of the DFT in various digital signal processing
applications, such as linear filtering, correlation analysis, and spectrum analysis, its efficient
computation is a topic that has received considerable attention by many mathematicians,
engineers, and applied scientists. Basically, the computational problem for the DFT is to
compute the sequence {X (k)} of N complex-valued numbers given another sequence of data {x
(n)} of length N, according to the following formula:
In general, the data sequence x (n) is also assumed to be complex value. Similarly, the Inverse
Discrete Fourier Transform (IDFT) becomes: We can observe that for each value of k, direct
computation of X (k) involves N complex multiplications (4N real multiplications) and N-1
complex additions (4N-2 real additions). Consequently, to compute all N values of the DFT
requires N2complex multiplications and N2−N complex additions. Direct computation of the 10
DFT is basically inefficient primarily because it does not exploit the symmetry and periodicity
properties of the phase factor WN. In particular, these two properties are:
Symmetric property
Periodicity property
The computationally efficient algorithms described in this section, known collectively as fast
Fourier transform (FFT) algorithms, exploit these two basic properties of the phase factor.
4.2 FFT Algorithms
The FFT exist in two functionally equivalent forms known as decimation in time (DIT) and
decimation in frequency (DIF). Both are a decomposition of the DFT by processing through
sample computational units and reducing the computational complexity of DFT from O (N 2) to
O (N log (N)). The various algorithms that result from the FFT are collectively known as Radix-
R Fast Fourier Transforms.
The most popular Radix r choices are those of r = 2 and r = 4, and a commonly used
advancement upon the FFT is the use of a mixed radix.
4.3 Radix-2 FFT Algorithms
The Radix-2 algorithm takes the DFT and applies a common factor reduction equating the sum
of two N/2 sequences to the N point sequence of the original DFT. Resulting in the Radix-2 FFT
formula below:
This result in processing that follows the signal flow graph as shown in Figure
10
Figure 5-Radix-2 for an N point FFT
There are two methods of radix-2 algorithm
Decimation in time FFT algorithm(DIT)
Decimation in frequency FFT algorithm(DIF)
4.3.1 Decimation-In-Time FFT
now let us consider the computation of N=2v point DFT by the divide and conquer approach.
We split the n point data sequence into N/2 point data sequences f1(n) and f2(n),
corresponding to the even numbe0 r and odd number of samples of x(n) respectively. Thus
f(n) and f2(n) are obtained by decimating x(n) by factor of 2 and hence the resulting FFT
algorithm is called decimation in time FFT algorithm.
The equation can be expressed as:
10
We observe that the direct computation of F1(k) requires ( N /2 )2 complex
multiplications. The same applies to the computation of N/2 additional
complex multiplications required computation of X(k) requires 2( N /2 )2+
N/2=N2/2+N/2 complex multiplications. This first step results in a
reduction of the number of multiplications from N2 to N2/2+N¿2Which is
about the factor of 2 for N large. By computing N/4 point DFTs , we
should obtain N/2 point DFTs F1(k) and F2(k) form the relations
The decimation of the data sequence can be repeated again and again
until the resulting sequences are reduced to one one-point sequences. For
N = 2v this decimation can be performed v=log 2 N times. Thus the total no.
of complex multiplication is reduce to (N/2) log 2 N . The number of complex
addition is Nlog 2 N . Illustration, Figure depicts the computation of N = 16
point DFT. We observe that the computation is performed in five stages,
beginning with the computation of eight 2 point DFTs, then four 4 point
DFTs, then 2 eight point DFTs. Finally one 16 point DFT.
10
. Figure 6-Butterfly structure
4.3.2 Decimation in frequency
Another important radix-2 FFT algorithm, called the decimation in
frequency algorithm, is obtained by using the divide-and-conquer
approach. To derive thealgorithm, we begin by litting the DFT formula into
two summations, one of which involves the sum over the first N/2 data
points. Thus we obtain
10
Now, let us split (decimate) x(k) into the even and odd number samples.
Thus we obtain
The computation procedure above can be repeated through decimation of the N/2 point DFTs
X(2k) and x(2k+1). The entire process involves V= log 2 N stages of decimation where
each satge involves N/2 butterflies of the type Thus the total no. of complex
multiplication is reduce to (N/2) log 2 N. The number of complex addition is N
log 2 N . Illustration, Figure depicts the computation of N = 16 point DFT. We
observe that the computation is performed in five stages, beginning with the
computation of eight 2 point DFTs, then four 4 point DFTs, then 2 eight point
DFTs. Finally one 16 point DFT. The combination for
10
Figure7-Decimation in frequency
4.4 Algorithms Implementation
In this project, we implemented both methods Decimation-In-Time (DIT) and Decimation-In-
Frequency (DIF) to verify which one is better in efficiency, speed, performance and delay. The
following researchers are known to apply the same methods:
10
Weidong Li, Jonas Carlsson, Jonas Claeson, and Lars Wanhammar (Electronics Systems,
Department of Electrical Engineering Linkö ping University) employed Fast Fourier Transform
algorithm in Global Asynchronous Local Synchronous (GALS) based on decimation-in-
frequency radix-2 algorithm. They proved in their simulation that DIF has high performance and
efficiency. Mohd Nazrin (UTM 2004) applied Fast Fourier Transform algorithm in FPGA
technology. The design was based on decimation-in-time radix-2 algorithm. Pursuant to his
simulation and results he concluded that “DIT has many advantages such as high efficiency,
speed, performance and low delay. Obviously, both methods give the same results but we are
looking for performance, speed, hardware cost and efficiency. Through this thesis we will know
the advantages and disadvantages of both the DIT and the DIF.
4.4.1 FPGA Implementation of FFT Algorithm
In hardware implementation, inclusion of custom instruction in Nios II processor is done.
While designing a system, that includes an Nios II embedded processor, we can
accelerate time-critical software algorithms by adding custom instructions to the Nios II
processor instruction set. Custom instructions allow you to reduce a complex sequence of
standard instructions to a single instruction implemented in hardware. You can use this
feature for a variety of applications, for example, to optimize software inner loops for
digital signal processing (DSP), packet header processing, and computation-intensive
applications. In SOPC Builder, the Nios II parameter editor provides a GUI to add
custom instructions to the Nios II processor.
10
Figure8-Custom Instruction Logic connects to ALU in SOPC builder system
In SOPC Builder, the custom instruction logic connects directly to the Nios II
4.4.3 Design steps for the hardware implementation
After software implementation same procedure has to be followed with some changes in
system.
1. For hardware implementation system is generated in SOPC builder with inclusion of
Custom Instruction as hardware.
2. Implementation of Algorithm is done on FPGA using NIOS II IDE.
3. The program calculates the processing time and throughput for each of the versions, to
demonstrate the improved efficiency of a custom instruction compared to software
implementation.
4.5 HARDWARE IMPLEMENTATION
In Hardware implementation, Custom Instruction is added in NIOS II Processor. The SOPC GUI
supports the inclusion of custom instructions.
10
Figure 9- Custom instruction block in SOPC GUI.
Figure 10- Inclusion of Custom Instruction in NIOS II Processor
10
Figure 10 and Figure 11 shows the addition of custom instruction in NIOS II Processor. The
inclusion of Custom Instruction is added as a floating point hardware in processor keeping all
other peripherals same, it leads to increase in hardware. The Custom Instruction maps the
memory location from SRAM interface in SOPC builder.
Figure 11-RTL view of Custom Instruction
4.5.1 Implementation of Custom Instruction Hardware and Software
Multicycle custom instructions complete in either a fixed or variable number of clock
cycles. For a custom instruction that completes in a fixed number of clock cycles, you
specify the required number of clock cycles at system generation. For a custom
instruction that requires a variable number of clock cycles, you instantiate the start and
done ports. These ports participate in a handshaking scheme to determine when the
custom instruction execution is complete. The execution this instruction is shown in
following custom instruction hardware port timing diagram.
10
Figure 12-Multicycle Custom Instruction Timing Diagram
The processor asserts the active high start port on the first clock cycle of the custom
Instruction execution. At this time, the dataa and datab ports have valid values and
remain valid throughout the duration of the custom instruction execution. The start
signal is asserted for a single clock cycle.
For a fixed length Multicycle custom instruction, after the instruction starts, the
processor waits the specified number of clock cycles, and then reads the value on the
result signal. For an n-cycle operation, the custom logic block must present valid data on
the nth rising edge after the custom instruction begins execution. For a variable length
Multicycle custom instruction, the processor waits until the active high done signal is
asserted. The processor reads the result port on the same clock edge on which done is
asserted. The custom logic block must present data on the result port on the same clock
cycle on which it asserts the done signal.
The Nios II system clock feeds the custom logic block’s clk port, and the Nios II
system’s master reset feeds the active high reset port. The reset port is asserted only when
the whole Nios II system is reset. The custom logic block must treat the active high
clk_en port as a conventional clock qualifier signal, ignoring clk while clk_en is
deasserted. The Nios II custom instruction software interface is simple and abstracts the
details of the custom instruction from the software developer. For each custom
instruction, the Nios II embedded Design Suite (EDS) generates a macro in the system
header file, system.h. You can use the macro directly in your C or C++ application code,
10
and you do not need to program assembly code to access custom instructions. Software
can also invoke custom instructions in Nios II processor assembly language. Custom
Instruction is added as floating point hardware in NIOS II processor.
4.6 Results of FFT algorithm using Custom Instruction
The System is generated in SOPC Builder. In this system the custom instruction is added as
floating point hardware.
Figure 13- System with Custom Instruction in SOPC Builder
10
Figure14- System Generation for Custom Instruction
Figure 15 and Figure 16 shows the system contents and generation of system in SOPC Builder.
After generation of system in SOPC, the pin assignment and compilation is done in QUARTUS
II.
10
Figure15 NIOS System with custom instruction Block Diagram File view
In Quartus II, The pin assignment is done by importing the file of CYCLONE II
(EP2C35F672C6) FPGA. In the figure 6.6 the address lines from SRAM memory are assigned to
custom instruction.
Figure 16-RTL view of NIOS II System
After the successful compilation of system, hardware generation in CYCLONE II FPGA is done
and time limited file is generated as shown in Figure 17.
10
Figure 17- Hardware generation of NIOS II System in CYCLONE II
After hardware generation of system using CYCLONE II FPGA, the algorithm is implemented
in NIOS II IDE. We have to follow same steps as stated in chapter 5 with addition of Custom
Instruction files in NIOS II IDE Project.
After generating hello_world.c file, we have to add custom instruction files floating_point.c,
floating_point.h, floating_point_CI.c and floating_point_SW.c in the project. Then the project
is build by command Build Project. While building a project, a macro function system.h is
generated which connects the C-code of algorithm to Custom Instruction hardware.
After successful build of project the Algorithm is implemented on CYCLONE II
(EP2C35F672C6) FPGA. This implementation is done by command Run as NIOS II
Hardware. After execution of this command the result is shown in NIOS II console window.
10
Figure 18- FFT result without Custom Instruction
In Figure 18, the output of implementation of Algorithm (Encryption) on FPGA is shown with
clock cycles and time required for execution of Algorithm (Encryption).
10
Figure 19 FFT result with Custom Instruction
In Figure 19, the output of implementation of Algorithm on FPGA is shown with clock cycles
and time required for execution of Algorithm .
The table no.3 shows the clock cycles required for execution with custom instruction and
without custom instruction.
10
Type Clock Cycles Time Required
Without
custom
instruction7212992 0.14426
With
custom
instruction5937794 0.1186
Table no. 3 Clock Cycles for execution of Algorithm
10
5.1 RESULT
The System generated in SOPC with and without Custom Instruction is compiled in Quartus II
environment. The FFT algorithm is compared in terms speed i.e. number of clock cycles required
and area in terms of number of Logic elements (LE’s) on FPGA.
Table 4 CPU Clock cycles and time required
From the performance analysis results of clock cycles and time required for execution of FFT
algorithm in software i.e. system without Custom Instruction is more as compared with
execution of algorithm in hardware i.e. system with custom instruction.
In hardware implementation custom instruction maps the processes in software such as addition,
subtraction, multiplication & division directly to the ALU of NIOS II processor which is used
directly as hardware. Inclusion of such hardware reduces the clock cycles and time required for
execution of algorithm.
The system which is generated using SOPC Builder is compiled in Quartus II software.The
hardware required for generation of system is depend upon the LE’s used in CYCLONE II
(EP2C35F672C6) FPGA . The comparison in hardware change is shown in table 6.
10
Type Clock Cycles Time Required
Without
custom
instruction
7212992 0.14426
With
custom
instruction
5937794 0.1186
Items Total
Count
Without Custom
Instruction
With Custom
Instruction
Total Logic Elements 32216 (10%) (14%)
Total Combinational
Functions
32216 (9%) (13%)
Dedicated Logic Registers 32216 (6%) (9%)
Total Pins 475 (9%) (15%)
Total Memory Bits 473840 (10%) (10%)
Embedded multipliers 9-bit
Elements
70 (6%) (16%)
Table 5 comparison of compilation report
The above table shows the comparison between the software and hardware systems i.e. system
with and without custom instruction. It shows that inclusion of custom instruction increases the
hardware which gives better result in terms of clock cycles and require for execution of
algorithm.
5.2 CONCLUSION
FFT algorithm considered as a case study is implemented using the hardware / software co-
design methodology. Hardware / software co-design methodology implementation gives an
optimized design of the algorithm. Algorithm is implemented on CYCLONE II FPGA based
around NIOS II processor. Speed of the algorithm is increased by (18.46%). The conclusions
of the entire experiment and the project are presented in this chapter.
Recommendations for enhancing the precision and performance of FFT
10
embedded system are also included in this chapter. The recommendations
include speed and logic cells requirements.
5.3 Concluding Remarks
This thesis demonstrates the design of embedded system and FPGA
technology implementation of Fast Fourier Transform algorithm. The
algorithm used was radix-2 decimation-in-time for 32-floating point. The FFT
embedded system was included the floating point custom instruction as an
alternative choice for the floating point arithmetic operation. The floating
point custom in
struction has given the system better performance and speed in the floating
point operation, which has been proven in the result. Moreover in this thesis,
I introduced a new technique to provide any kind of data to FPGA
development technology from host PC by using specified GUI by DEV-C. This
method will make the connection easier, uncomplicated, useful and practical.
Finally, our experiment thus far have demonstrated promising results,
indicating that floating point custom instructions can result in large
improvements in performance, energy, and timing, while significantly
reducing design turnaround time.
10
FUTURE SCOPE
There are countless ways that the designed FFT embedded system can be
improved; for example: by introducing Higher N-Point FFT Computation, the
algorithm architecture in the decimation-in-frequency and high radix can be
used to make the design more robust
As the selected Processor is soft core processor, enabling change of hardware according to the
application. Optimization in area, required for FFT Algorithm, can be obtained by designing
various optimization approaches for the various blocks of the algorithm.
Here, FFT algorithm is accelerated using custom instruction with NIOS II processor, in future
acceleration can be done with using C2H compiler tool.
The extension of this project could be done in the field of communication where the ever
increasing demand on signal processing capabilities which has given rise to the importance of the
Fourier transform to the field. The importance of the Discrete Fourier Transform (DFT) in
various digital signal processing applications such as linear filtering, correlation
analysis, and spectrum analysis is always felt. Advantages offered by the system are
that cyclone 2 board operates at up to 250MHz when an external device is connected another
advantage is that FPGA(EP2C35F672C6) has about 35536 logic elements which enables us to
design much more highly integrated circuits.
10
Reference
[01] Ernst, R.: “Co design of embedded systems: status and trends”, Proceedings of IEEE
Design and Test, April–June 1998, pp.45–54
[02] T. Ismail, A. Jerraya, “Synthesis Steps and Design Models for Codesign” IEEE Computer
[03] A. Kalavade, E. Lee, “A Hardware-Software Co-design Methodology for DSP
Applications,” IEEE Design and Test, vol. 10, no. 3, pp. 16-28, Sept. 1993.
[04] Edix Cetin, Richard C.S.Morling and Izzet Kale,(1997) “ An Integrated 256-point
Complex FFT Processor for Real-Time Spectrum Analysis and Measurement”, IEEE
Proceedings of Instrument and Measurement Technology Conference, Vol. 1.96-101.
[05] Hermann, D., J. Henkel, R. Ernst, “An approach to the estimation of adapted Cost
Parameters in the COSYMA System”, 3rd International Conference on
Hardware/Software co-design, Grenoble, France, September 22-24, 1994, pp. 100-107
[06] Jason G. Tong, Ian D. L. Anderson and Mohammed A. S. Khalid: Soft-Core Processors
for Embedded Systems, the 18th International Conference on Microelectronics (ICM)
2006.
[07] Coelo Jr, C. J. N., Da Silva Jr., D. C., and Fernandes, A. O. “Hardware software codesign
of embedded systems”, Proceedings of the 11th Brazilian Symposium on Integrated
Circuit Design, January 1998.
10
[08] Journal of ELECTRICAL ENGINEERING, VOL. 56, NO. 9-10, 2005, 265–269
“HARDWARE IMPLEMENTATION OF AES ALGORITHM” by Marko Mali-Franc
Novak-Anton Biasizzo.
[09] Monson H. Hayes. “Digital Signal Processing”. Schaum’s outlines 1999.
[10] MOHD NAZRIN. “The Implementation of Fast Fourier Transform (FFT) Radix-2 Core
Processor using VHDL in FPGA-Based Hardware”. UTM 2003
[11] Altera Corporation (2004a). “UART core with Avalon Interface”
[12] Altera Corporation (2007). “Nios II Processor Reference Handbook”
[13] Altera Corporation. “Using Nios II Floating Point Custom Instruction”
[14] Altera Corporation. “Nios II Custom Instruction User Guide”
[15] Altera Corporation, “Nios Software Development Tutorial”.
[16] Altera Corporation, “SOPC Builder Data Sheet”.
10