FAULT TOLERANCE IN FPGA SYSTEMS
Karisma Ramesh
451126715
CSE661
December 12, 2014
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 1 of 22
Abstract
Field Programmable Gate Arrays (FPGAs) are of high importance in many
different applications, and represent a lot of money in the computer engineering
community. They are highly reconfigurable, and thus highly complex. FPGAs must
work reliably, and so methods of detection, diagnosis, and fault tolerance have been
found to make sure that FPGAs work when they are needed, the abundance of
configurable logic elements and routing resources in recent FPGAs provides a cost-
effective method for tolerating permanent faults in the system. Once a permanent
fault occurs, the FPGA is reconfigured by replacing the faulty part with previously
unused resources in the same hardware. Reliability and process variability are
serious issues for FPGAs in the future. Fortunately FPGAs have the ability to
reconfigure in the field and at runtime, thus providing opportunities to overcome
some of these issues. This paper provides a comprehensive survey of fault tolerance
schemes specifically for FPGAs, with the goal of laying a strong foundation for future
research in this field. All methods and schemes are qualitatively compared and some
particularly promising approaches highlighted. Methods for fault tolerance that cover
single and multi-FPGA systems, as well as for systems experiencing singular faults or
multiple faults at once are listed. In this paper, two column-based precompiled
configuration techniques are presented for tolerating permanent faults in FPGA-
based systems. By compiling alternative configuration versions in the design phase,
these approaches ensure fast reconfiguration, and thus a tremendous increase in
system availability. In addition, intentional similarities are created among different
configuration versions so that the storage overhead due to precompiled
configurations is reduced by orders of magnitude through differential coding and
run-length coding.
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 2 of 22
Introduction
As process technology scaling continues, manufacturing large defect-free
integrated circuits will become increasingly difficult. There is also the added problem
of degradations over time after a device has been successfully deployed in the field.
These reliability issues are particularly acute with FPGAs. Similar to memories, FPGAs
have high density of Transistors and interconnect wires. A recent study suggested
that future FPGAs at and beyond the 45nm technology node will have such low yield
that defect tolerance scheme will be unavoidable in large FPGAs. FPGAs present
interesting opportunities for fault tolerance due to their ability to be reconfigured. A
fault tolerant system consists of two main components; these are fault detection and
fault repair. Design of fault-tolerant systems (FTS) is gaining popularity and
importance, because high reliability and high availability are becoming a standard
requirement. Typical application areas are the transportation, communications, on-
line process control, or multiple-access transaction processing. Fault tolerance (FT)
can be carried out by means of using spare resources to replace the faulty ones and
to take over their function. However, it mostly leads to increased area overhead and
degradation in the system performance, but also to the increase in energy
consumption, heat dissipation and weight and volume increase. Due to the current
progress in semi-conductor technology, these drawbacks are becoming less
important and are largely overweighed by the desirable functional properties of the
resulting systems. There are many approaches to FTS design that can be classified
according to several criteria. To the most important criteria belongs the level of FT
tool application. Here we can distinguish three main streams: system-level FT,
building block (intermediate) level FT and component-level FT. As the designers
mostly cannot change the internal structure of the components used, the first two
approaches are used more frequently and are therefore developed to a higher degree
of sophistication.
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 3 of 22
FPGA Architecture:
In general, FPGAs consist of regular arrays of programmable logic blocks (PLBs)
connected to each other by a programmable routing matrix. An FPGA configuration
defines the functionality of an FPGA, specifying which logic blocks are used and which
wire segments are used to connect them, as well as what functionality each block
provides. As in Figure 1.
Figure 1: Figure 1: Architecture of the programmable logic core in FPGAs.
In this model, the programmable logic core of an FPGA consists of an array of three
basic elements: Configurable Logic Blocks (CLBs), Connection Boxes (CBs), and
Switch Boxes (SBs).In SRAM-based FPGAs, a CLB contains several SRAM lookup tables
(LUTs) to store user-defined logic functions. A CLB also contains flip-flops,
multiplexers, and dedicated circuitry to optimize the performance of user
applications. CLBs are connected through horizontal and vertical wiring channels
between two neighboring rows or columns. To enhance the connectivity, there are
various kinds of wires with different lengths for connecting CLBs that are different
blocks apart. For example, in Xilinx Virtex-series FPGAs as in Figure 2, single lines
connect adjacent CLBs, while hex lines connect CLBs that are three or six blocks apart
There are two types of routing devices, CBs and SBs, to direct the signal flows among
CLBs and wiring channels. CBs serve as a local bridge between CLBs and the adjacent
wiring channels. SBs are switch matrices that connect horizontal and vertical wiring
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 4 of 22
channels. In SRAM-based FPGAs, the state of connections in these routing devices is
controlled by SRAM cells, which are configured according to the desired functionality.
Figure 2: Configuration frame architecture in Xilinx Virtex-series FPGAs.
Fault Tolerance
A fault can be defined as a physical occurrence within an FPGA that causes it
to malfunction, such as a broken wire caused during manufacture by a dust particle.
Faults usually occur at the beginning and end of a chip’s life cycle. Fabrication faults,
or defects, are usually caused by contaminants or other flaws in the manufacturing
process and are detected during manufacture testing. Late life faults are usually due
to failure of device resources. However, faults can and do appear at any point during
the life cycle of an FPGA. Faults may also be transient (impermanent). Transient faults
are more difficult to detect because they do not constantly manifest themselves. As
seen in Figure 3, the failure rate after the chip has left the factory and before the end
of its life is typically constant and is due to environmental stresses. The time axis may
be compressed, possibly significantly, if the device is used in a harsh environment.
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 5 of 22
Figure 3: System failure rate during the life cycle of an FPGA.
FPGA fault tolerance methods can be divided into two groups, based on the
level of abstraction. The first group attempts to deal with faults at the level of the
FPGA hardware. These methods, which will be termed device-level (DL), generally
attempt to construct a fault-free array from a larger array containing faulty resources.
When faults are discovered, permanent routing or logic changes are made within the
FPGA, selecting redundant hardware resources to replace the faulty ones. Because
alterations must be made at the hardware level, DL methods are generally only used
for yield enhancement.
The second group of methods takes a higher-level approach, tolerating
faults at the level of the FPGA configuration. These configuration-level (CL) methods
treat the FPGA as a set of abstract resources, often represented as a graph structure,
without considering the actual physical structure of the device. When a circuit is
placed-and-routed, fault-free resources are selected from the set of available
resources. The status of resources as faulty vs. fault-free is considered each time a
circuit is placed-and-routed, so CL methods are able to tolerate new faults in the field.
Obviously CL methods are not transparent to the tools, so tolerance of new faults
requires additional configuration time. Both classes of fault tolerance have strengths
and weaknesses, and not all approaches fit neatly into one of these categories.
Deciding which method is best for a given purpose depends on several metrics,
including speed, power consumption, size, and cost. Generally, DL methods are used
by manufacturers, while CL methods are used by end-users.
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 6 of 22
Application of Fault Tolerance
All of the fault detection and repair methods surveyed have individual strengths and
weaknesses and which method is most appropriate depends on the application. In
some cases reliability is critical for safety or mission success. Fast detection is crucial
here so that erroneous data or state is not acted upon.
Another class of reliable system is one where a physical repair is not practical. Fault
coverage and repair flexibility are important here. In the light of variability and
reliability concerns associated with future VLSI process nodes, it may become
economical to use fault tolerance in general purpose, high volume applications. In this
case, it will be important that the detection and repair method has the lowest possible
overhead on timing performance and area.
METHODS OF FAULT DETECTION
The first step of a fault-tolerant scheme is fault detection. Fault detection has two purposes;
firstly, it alerts the supervising process that action needs to be taken for the system to remain
operational and, secondly, it identifies which components of the device are defective so that
a solution can be determined. These two functions may be covered simultaneously, or it may
be a multi-stage process comprising of different strategies. Fault detection methods can be
categorised into three broad types:
1. Redundant/concurrent error detection uses additional logic as a means of
detecting when a logic function is not generating the correct output.
2. Off-line test methods cover any testing which is carried out when the FPGA is not
performing its operational function.
3. Roving test methods perform a progressive scan of the FPGA structure by swapping
blocks of functionality with a block carrying out a test function.
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 7 of 22
Functional redundancy and Concurrent Error Detection (CED)
Redundancy is widely used as a method of fault detection in FPGAs, particularly in the
form of Triple Modular Redundancy (TMR). The main driver for error detection of
this kind is the need to detect and correct errors due to SEUs and SETs. However,
these methods will detect any error which occurs whilst the system is operating.
These detection methods work by invoking more than the minimum amount of logic
than is necessary to implement the logic function. When an error occurs, there is a
disagreement between the multiple parts of the circuit over which a particular
calculation is processed and this is flagged by some form of error detection function.
The simplest form is modular redundancy. A functional block is replicated, usually
two or three times and the outputs compared. Any difference indicates that a fault is
present. Concurrent Error Detection (CED) allows a more space efficient design than
modular redundancy. Data flows and stores are widened using error coding
algorithms such as parity. Data validation circuitry at the output to functional blocks
can then detect faults which arise. Redundancy provides a very fast means of error
detection, as a fault is uncovered as soon as a discrepancy occurs. In addition, this
form of error detection has a small impact on timing performance; just the latency of
voting or parity logic, or similar. The chief drawback of error detection using
redundancy is the area overhead needed to replicate functionality, which can be over
three times in the case of TMR .Furthermore it provides a limited resolution for
identification of the faulty component. The fault can only be pinned down to a
particular functional block or, in the case of TMR, an instance of a functional block.
Fault resolution can be increased to a certain extent by breaking functional areas
down and adding additional error detection logic. Modular redundancy detects all
faults which become manifest at the output of a functional block, including transient
errors. In CED, coverage comes at the expense of additional area. These methods
provide no coverage of dormant faults, which may be relevant if an FPGA is going to
be reconfigured in the field, either for fault repair or to alter the functionality.
Redundancy does not have to be restricted to the circuit-area dimension. It is also
possible to detect errors in a trade-off with latency/data throughput. Proposes a
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 8 of 22
scheme where operations are carried out twice. In the second operation, operands
are encoded in such a way that they exercise the logic in a different way. The output
is then passed through a suitable decoder and compared to the original. Although
most of the work on redundancy has been aimed at detecting and correcting SEUs,
there have been some notable publications which apply the techniques tofault
detection. Parity checking is used as part of a fault tolerant scheme which is
structured so that detection is applied to small regular networks, rather than being
bespoke to the function that is implemented. Redundant and data-checking detection
systems are generally designed into an FPGA configuration, as they fit around the
specific data and control functions that are implemented.
Off-line fault detection / Built-In Self-Test
Off-line fault detection is another widely-used technique; usually as a means of
quickly identifying manufacturing defects in FPGAs. Any scheme which does this
without any external test equipment is known as Built-In Self-Test (BIST), and is a
suitable candidate for fault detection in the field. BIST schemes for FPGAs work by
having one or more test configurations which are loaded separately to the operating
configuration. Within the test configuration is a test pattern generator, an output
response analyser and, between them, the logic and interconnect to be tested,
arranged in paths-under-test. To be fully comprehensive, a BIST system will have to
test not only the logic and interconnect, but also the configuration network. Many
recent FPGAs feature a self-configuration port which can make this possible without
the need for a large number of different test configurations, which can take a long
time to load. Specialised features such as carry chains, multipliers and PLLs also need
to be considered. Compared to traditional built-in and external test methods for ASICs,
FPGAs have the advantage of a regular structure which does not need a new test
programme to be developed for each design. Also, the ability to reconfigure an FPGA
reduces or removes the need for dedicated test structures to be built into the silicon.
However, with the ability to reconfigure comes a vast number of permutations in the
way the logic can be expressed, making optimization of test patterns important. The
advantage of BIST as a fault detection method is that it has no impact on the FPGA
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 9 of 22
during normal operation. The only overhead is the requirement to store test
configurations which are typically small. BIST also allows complete coverage of the
FPGA fabric, including features which may be hard to test with an on-line test system,
such as PLLs and the clock network. The major drawback of BIST is that it can only
detect faults during a dedicated test mode when the FPGA is not otherwise
operational. This can be either during system start-up, as part of a maintenance
schedule or in response to an error detected by some other means. Published BIST
methods have competed for coverage, test duration and memory overhead. Many
focus on testing just one subset of FPGA structures, e.g. interconnect, suggesting a
multi-phased approach for testing the whole chip. Testing of LUTs is a mature field
and recent developments have considered timing performance as well as stuck-at
faults. Many publications have focussed on testing interconnect in esponse to the
large amount of configuration logic and silicon area it consumes .In a BIST system for
interconnect is given which reduces test time through a large degree of self
configuration. Elements of BIST can be found in roving test systems, where only a
small part of the FPGA is taken off-line for testing at any point in time.
Roving fault detection
Roving detection exploits run-time reconfiguration to carry out BIST techniques on-
line, in the field, with a minimum of area overhead. In roving detection, the FPGA is
split into equal-sized regions. One of these is configured to perform self-test, while
the remaining areas carry out the design function of the FPGA. Over time, the test
region is swapped with functional regions one at a time so that the entire array can
be tested while the FPGA remains functional. Roving test has a lower area overhead
than redundancy methods; the overhead comprising of one self-test region and a
controller to manage the reconfiguration process. The method also gives excellent
fault coverage and granularity, comparable to BIST methods. The speed of detection,
while better than off-line methods that cannot test during normal operation, is not as
good as redundancy techniques. The detection speed depends on the period of a
complete roving cycle; the best reported implementations of roving test have
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 10 of 22
maximum detection latency in the order of a second . Roving test impacts
performance in two ways. Firstly, as the test region is moved through the FPGA,
connections between adjacent functional areas are stretched. This results in longer
signal delays and may force a reduction in the system clock speed, reported to be in
the range of 2.5- 15% .Secondly, implementations in current FPGAs require the
functional blocks to be halted as they are switched.
Methods of Fault Tolerance
Single Fault Tolerance:
These are mostly transient faults caused by extraordinary circumstances in the
environment. Examples of these are charged particles striking the FPGA while it is in
space, or radioactive materials sending out energy which lodges inside the FPGAs
vulnerable systems. Because SEU faults are so common a plethora of methods have
been devised to mitigate them. Described here are some of the more noteworthy
efforts in this area.
Researchers in Milan used Triple Modular Redundancy and partial dynamic
reconfiguration to mitigate SEU faults in FPGAs. Their TMR was the simple
passive model, and they showed that this successfully mitigates soft error
effects. They also evaluated different levels of granularity. This led to an
examination of the tradeoffs between cost, performance, and recovery time.
A group at the Federal University of Rio Grande in Brazil created a fault
tolerant system for use in space, which as stated has many environmental
hazards that must be dealt with. Their system was implemented on a Xilinx
Virtex FPGA, but could be extended to other brands. They determined that
transient faults can cause a permanent effect because of the permanent nature
of SRAM-based FPGA cells. Their method used Duplication with Comparison
and Concurrent Error Detection based on time redundancy. They chose this
way because it worked well without the power and pin overhead caused by a
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 11 of 22
traditionally TMR system. In addition this reduces the necessary hardware by
replacing some hardware redundancy with time redundancy
At the University of Florence a group devised a fault tolerant architecture that
avoids the effects of Single Event Upset in avionics applications. They showed
that the error rate of a device increases rapidly as the altitude of the FPGA
increases. They also found that SRAM FPGAs are used much more in avionics
applications than other types of FPGA devices. Further they found that the
configuration memory has the worst upset rate out of the pieces that make up
the critical components of the Integrated Control Panel for military aircraft
cockpits. In order to avoid these catastrophic single event upsets they used a
form of information redundancy that fixed the problems while also saving on
the cost and usage of FPGA resources. They found that for this particular
application and under these circumstances that a combination of a SEC-DED
code and a parity bit provided the best solution. The SEC-DED code protects
registers from upset, and the parity bit shields signals against transient faults.
Multiple Fault Tolerance :
While single faults account for many of the problems that FPGA systems encounter there are
some environments or some applications where multiple faults can happen simultaneously.
Some of these situations are due to the fact that feature size continues to decrease. This by
itself can cause many faults in a manufactured device.
One such method utilizes unused resources present in the FPGA to place repair
circuits for in-field FPGA repair. These unused portions are usually caused by
manufacturing defects. Normal place and route algorithms ignore them because they
do not provide the necessary complete functionality. Despite this researchers at
Brigham Young University identified times and areas where the unused resources can
be taken into account, and even utilized to create fault tolerance. Their method
identified repair circuits that could be used when necessary before all possible
permanent faults occurred. This method fixes faults in programmable logic, fixed
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 12 of 22
functions such as DSP and BRAM, and routing resources. These possible alternate
repair circuits when either cached locally or are made available remotely can create
a high degree of fault tolerance in the target device. One disadvantage though is that
an incredibly large amount of computation must be done before hand in order to
create these alternate repair configurations. In their paper they show how this all
works on a Xilinx Virtex 4 FPGA, and outline three possible placement algorithms for
these repair circuits
Years ago researchers published an important paper on efficiently supporting fault
tolerance in FPGAs. Their method used a set of tiles to provide the desired level of
fault tolerance. They investigated the physical resources, interface, and net-list that
each tile would require. Further they found that having multiple configurations for
each tile would give increased reliability under various operating circumstances.
They showed that partially reconfiguring a FPGA to an alternate configuration in
response to a fault mitigated many different kinds of faults. This led to many other
great discoveries in this area that other researchers have since been able to capitalize
on. This kind of fault tolerance was hierarchical, with both tiles and atomic fault-
tolerant blocks.
A group, at the University of Florida, found that single-event upset hardened FPGAs
have reconfiguration limitations and small capacities, and as such are only suitable
for a limited range of applications. To combat this they devised a reconfigurable fault
tolerance framework. This framework enabled their system to dynamically self-adapt
to whatever the current upset rate happened to be. Further their method has two
modes, one for high reliability and one for high performance. Their system has the
ability to decide which mode to operate in, depending on the current upset rate.
Because many hazardous environments, such as space, have a high degree of
variability, their system can make smart decisions based on the current
circumstances. This self-adaptation allows their system to operate in many different
environments, as well as operate most efficiently in a given environment. When there
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 13 of 22
are many faults entering the system from external sources it can run in a highly
reliable fashion. Once the cause of these faults settles the system can then switch to
high performance mode and get a lot of work done while it is able.
Hardware Level Fault Tolerance:
Hardware Level tolerance: performs a correction such that the FPGA remains
unchanged for the purposes of the configuration. The device retains its original
number and arrangement of useable logic clusters and interconnects. The regular
structure of FPGAs makes them suitable candidates for hardware level repair, using
methods similar to those used for defect tolerance in memory devices. In the event of
a fault, a part of the circuit can be mapped to another part with no change in function.
Hardware level repair has the advantage of being transparent to the configuration.
This makes repair a simple process, as the repair controller does not need any
knowledge of the placement and routing of the design. Another benefit is that the
timing performance of the repaired FPGA can be guaranteed, as any faulty element
will be replaced by a pre-determined alternative. Hardware level fault tolerance has
a drawback in that it can tolerate just a low number of faults for a given overhead and
there are likely to be certain patterns of faults which cannot be tolerated. The most
common basis for hardware level repair is column/row shifting . Multiplexers are
introduced at the ends of lines of cells, allowing a whole row or column to be bypassed
by shifting across to a spare line of cells at the end of the array. If the FPGA is bus-
based, the shifted cells can connect to the same lines of interconnect. For segmented
interconnect, bypass sections need to be added to skip the faulty row / column. This
method can be found today in some commercial devices Adding more bypass
connections and multiplexers allows greater flexibility for tolerating multiple faults
and makes more efficient use of spare resources .
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 14 of 22
Configuration Level fault tolerance:
Configuration level repair exploits two key features in FPGAs; reconfiguration and the
availability of unused resources. Configuration level repair strategies can be divided
into three subclasses:
Alternative configurations:
A straightforward way of achieving fault tolerance is to pre-compile alternative
configurations. The FPGA is split into tiles, each with its own set of configurations
which have a common functionality and . Fault tolerance is achieved by replacing a
configuration tile with an alternative in which the faulty resource is not used. This
method requires little run-time computation as the placement and routing is already
done. This strategy performs relatively poorly in terms of area efficiency and fault
pattern. It is dependent on there being a configuration available in which any given
resource is set aside as a spare. If only a small amount of spare resource is available
then a large number of configurations are needed to cover all possible faults.
Incremental mapping, placement and routing:
A simple method of tolerating faults in logic clusters exists if the cluster can be
reconfigured to work , Cluster reconfiguration is simple to evaluate and does not have
a significant effect on design timing, but there may be cases where reconfiguration is
not possible, for example if an output register is faulty. If there are spare clusters, then
these can be used to replace faulty ones. Cluster reconfiguration is carried out in
preference and faulty clusters can be reused by a different function if the fault will not
be manifest. It is also important to consider faults in interconnect. Incremental
mapping, placement and routing provide a high degree of flexibility for dealing with
random fault patterns, especially when cluster reconfiguration and pebble shifting
are used together. However, this comes at the cost of increased computational effort
for the repair which must be carried out in the field. Spare clusters and interconnect
are allocated at design-time in order to reduce the complexity of a repair in the field
and to provide guaranteed timing performance. This technique, however, reduces
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 15 of 22
flexibility and overhead efficiency. In [18] a more coarse-grained approach is taken.
Clusters are arranged into networks such that most faults can be repaired by
reconfiguring the network and leaving the wider placement unchanged.
Evolutionary algorithms:
Reconfiguration makes FPGAs well suited to evolutionary algorithms. Configurations
compete for correctness and those which are faulty are mutated. Although this
approach allows a large degree of flexibility with the number and distribution of
faults that can be tolerated, the area and computational overhead required is very
large.There is also no guarantee of how long a solution will take to evolve or what its
timing performance will be.
System level Fault Tolerance :
works at a higher level. When adesign is highly modular, a fault can be tolerated by the use
of a spare functional block or by providing degraded performance . Such methods are not
considered in more detail here, as they are not limited in application to FPGAs.
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 16 of 22
Conclusion:
FPGAs are a very important computing resource for many different fields in the world today.
Their reconfigurability allows for incredible flexibility and reuse. But this benefit comes with
a cost. FPGAs are used in applications which demand a high amount of reliability and
availability. This is why many people over the years have researched the effects of faults
within FPGA systems, and different methods of detecting and diagnosing said faults. Once
the faults have been identified there are myriad ways to provide for fault tolerance.
Clearly, no single FT methodology is significantly better than the others, The best general
solution to FPGA FT is probably a combination of both DL and CL fault tolerance
methodologies. The most likely future advancement in fault tolerance will be in the area of
self-adaptation in the presence of faults. This will allow FPGAs to be fault tolerant no matter
the environment. As for detection and diagnosis the focus is always on improving speed,
coverage and overhead.
Appendix: Prototype for Fault Correction in
FPGA based systems.
Problem Statement: Faulting Correcting in a 8 Bit Kogge-Stone Adder
Solution: This is done by correcting the fault in each of the individual units of the
Kogge-stone adder.
Lets start with defining a
Kogge-Stone adder: adder, which is classified as a parallel prefix adder, has a critical
path on the order of (where is the width of the adder in bits). The regularity of its
structure makes it suitable for VLSI designs as well as FPGA implementations
The Kogge–Stone adder is a parallel prefix form carry look-ahead adder. It generates the carry
signals in O(log n) time, and is widely considered the fastest adder design possible. It is the
common design for high-performance adders in industry
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 17 of 22
An example of a 4-bit Kogge–Stone adder is shown below. Each vertical stage produces a
"propagate" and a "generate" bit, as shown. The culminating generate bits (the carries) are
produced in the last stage (vertically), and these bits are XOR'd with the initial propagate after
the input (the red boxes) to produce the sum bits. E.g., the first (least-significant) sum bit is
calculated by XORing the propagate in the farthest-right red box (a "1") with the carry-in (a "0"),
producing a "1". The second bit is calculated by XORing the propagate in second box from the
right (a "0") with C0 (a "0"), producing a "0".
Example of a 4-bit Kogge–Stone adder with zero carry-in.
The Kogge-Stone adder is classified as a parallel prefix adder since the generateand
the propagate signals are precomputed. In a tree-based adder, carries are generated
in tree and fast computation is obtained at the expense of increased area and power.
The main advantage of this design is that the carry tree reduces the logic depth of
the adder by essentially generating the carries in parallel. The parallel-prefix adder
becomes more favorable in terms of speed due to the O(log2n) delay through the
carry path compared to O(n) for the RCA. The Kogge-Stone adder is widely used in
high-performance 32-bit, 8
64-bit, and 128-bit adders as it reduces the critical path to a great extent compared
to the ripple carry adder.
This operator works on the generate and propagate pairs as defined by,
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 18 of 22
(gL, pL) (gR, pR) = (gL + pL gR, pL pR)
where gL, pL are the left input generate and propagate pairs and gR, pR are the right
input generate and propagate pairs to the cell. For example, in a 4-bit carry
lookahead adder, the carry combination equation can be expressed as,
c4 = (g4, p4) [ (g3, p3) [(g2, p2) (g1, p1)] ] (2.6)
= (g4, p4) [ (g3, p3) [(g2 + p2 g1, p2 p1)] ]
:
:
= g4 + p4 g3 + p4 p3 g2 + p4 p3 p2 g1
Since the fco obeys the associativity property, the expression can be reordered to
yield parallel computations in a tree based structure [7],
c4 = [(g4, p4) (g3, p3)] [(g2, p2 ) (g1, p1)] (2.7)
VHDL CODE:
VHDL Code for GeneratePropogate block correction in 8-bit Kogge-
Stone Fault Correcting Adder
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
entity GPblock is
port( a,b: in std_logic;
g,p: out std_logic);
end GPblock;
architecture Behavioral of GPblock is
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 19 of 22
begin
g <= a and b;
p <= a xor b;
Behavioral;
VHDL Code for mux correction in 8-bit Kogge-Stone Fault Correcting
Adder
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
entity mux is
port (x,y: in std_logic_vector(1 downto 0);
z : out std_logic_vector(1 downto 0);
sel: in std_logic);
end mux;
architecture Behavioral of mux is
constant delay: time :=100ns;
begin
mux_proc : process(x,y,sel)
variable temp : std_logic_vector(1 downto 0);
begin
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 20 of 22
case sel is
when '0'=> temp:=x;
when '1'=> temp:=y;
when others => temp :="XX";
end case;
z<= temp;
end process mux_proc;
end Behavioral;
VHDL Code for sum correction in 8-bit Kogge-Stone Fault Correcting
Adder
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
entity sum is
port( p,c: in std_logic;
s : out std_logic);
end sum;
architecture Behavioral of sum is
begin
s <= (p xor c);
end Behavioral;
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 21 of 22
Delay calculations for 8bit Kogge stone adder
References:
List of References:
1. Jason. A. Cheatham , John M. Emmert and Stan Baumgart 2006. A Survey of Fault Tolerant
Methodologies for FPGAs
2. Jano Gebelein, Heiko Engel and Udo Kebschull 2010. FPGA fault tolerance in radiation
susceptible environments
3. Khaled Elshafey, Jan Hlavicka ˇ 2002 . FAULT-TOLERANT FPGA-BASED SYSTEMS
4. Daniel Fisher, Addison Floyd . Survey of Detection, Diagnosis, and Fault Tolerance
Methods in FPGAs
5. Wei-Je Huang and Edward J. McCluskey . Column-Based Precompiled Configuration
Techniques for FPGA Fault Tolerance
6. Edward Stott, Pete Sedcole, Peter Y. K. Cheung 2007. FAULT TOLERANT METHODS FOR
FAULT TOLERANCE IN FPGA SYSTEMS | Karisma Ramesh
Page 22 of 22
RELIABILITY IN FPGAs
BasicsFPGA :https://www.google.com/search?biw=1536&bih=758&q=fpga+basics&revid=
1721831489&sa=X&ei=LEw0VPK7LIuxyAT6loLICg&ved=0CG8Q1QIoBQ
7. Technical Blogs:
http://www.pe-ip.com/ by Marc Perron http://billauer.co.il/blog/category/fpga/
by Eli Billauerhttp://fpgablog.com/
8. http://www.dtic.mil/dtic/tr/fulltext/u2/a462520.pdf
9. http://www.ijetch.org/papers/424-C049.pdf
10.http://www.academia.edu/5178678/High_Speed_Fault_Injection_Tool_Implemented_Wi
th_Verilog_HDL_on_FPGA_for_Testing_Fault_Tolerance_Designs
Others:
Wikipedia
A text book for Hardware description language Verilog and VHDL by A K pedroni