Muravin Project

7/27/2019 Muravin Project

1/25

Design of Two Different 128-bit Adders

Project Report

By Vladislav Muravin

Concordia ID: 5505763

COEN6501: Digital Design & Synthesis

Offered by Professor Asim Al-Khalili

Concordia University

December 2004


2/25

Table of Contents

1 INTRODUCTION............................................................................................................................... 4

1.1 REPORT ORGANIZATION............................................................................................................... 41.2 COMMON ADDERSTRUCTURES.................................................................................................... 4

1.2.1 1-bit Full Adder ...................................................................................................................... 4

1.2.2 N-bit Ripple Carry Adder ....................................................................................................... 41.2.3 Carry Skip Adder.................................................................................................................... 51.2.4 Carry Select Adder ................................................................................................................. 61.2.5 Carry Look Ahead Adder........................................................................................................ 61.2.6 Prefix Adders.......................................................................................................................... 71.2.7 Sklansky Prefix Adder............................................................................................................. 81.2.8 Kogge-Stone Prefix Adder ...................................................................................................... 9

2 DESIGN FLOW & IMPLEMENTATION..................................................................................... 10

2.1 MICRO ARCHITECTURE .............................................................................................................. 112.1.1 Top Entity ............................................................................................................................. 112.1.2 Sub-Block Partitioning ......................................................................................................... 12

2.1.2.1 "Carry Propagate" and "Carry Generate" Block (pg_gen) ......................................................... 132.1.2.2

Carry Generation Block ............................................................................................................. 14

2.1.2.2.1 Carry Generation Block Sklansky Prefix Adder (cg_gen_sklansky) ................................. 142.1.2.2.2 Carry Generation Block Kogge-Stone Prefix Adder (cg_gen_kogge_stone)..................... 14

2.1.2.3 Sum Bits Generation Block (sb_gen) ......................................................................................... 152.2 RTL CODING ............................................................................................................................. 152.3 VERIFICATION PLAN .................................................................................................................. 152.4 SYNTHESIS, PLACE AND ROUTE ................................................................................................. 16

3 RESULTS.......................................................................................................................................... 16

3.1 SIMULATION RESULTS ............................................................................................................... 173.1.1 Initial Test Cases .................................................................................................................. 173.1.2 General Test Case ................................................................................................................ 17

3.2 SYNTHESIS RESULTS .................................................................................................................. 203.2.1 Multiplexing I/O ................................................................................................................... 20

3.2.1.1 Multiplexed Inputs ..................................................................................................................... 203.2.1.2 Multiplexed Outputs .................................................................................................................. 213.2.1.3 Multiplexed Inputs and Outputs ................................................................................................. 21

3.2.2 Changing Target Device....................................................................................................... 21

4 DESIGN ENHANCEMENT PIPELINING................................................................................. 22

5 SUMMARY AND CONCLUSIONS ............................................................................................... 24

6 REFERENCES.................................................................................................................................. 25


3/25

Table of FiguresFIGURE 1: 1-BIT FULL ADDER......................................................................................................................... 4FIGURE 2: N-BIT CARRY PROPAGATE ADDER................................................................................................. 5FIGURE 3: CARRY SKIP CONCEPT.................................................................................................................... 5FIGURE 4: CARRY SELECT CONCEPT ............................................................................................................... 6FIGURE 5: SKLANSKY PREFIX TREE ................................................................................................................ 8

FIGURE 6: KOGGE-STONE PREFIX TREE .......................................................................................................... 9FIGURE 7: DESIGN FLOW............................................................................................................................... 10FIGURE 8: TOP LEVEL VIEW.......................................................................................................................... 11FIGURE 9: FULL_ADDER SUB-BLOCK PARTITIONING ...................................................................................... 12FIGURE 10: "CARRY GENERATE" AND "CARRY PROPAGATE" BLOCKIMPLEMENTATION ............................. 13FIGURE 11: SUM BITS GENERATION BLOCKIMPLEMENTATION .................................................................... 15FIGURE 12: TEST BENCH & VERIFICATION PLAN.......................................................................................... 16FIGURE 13: INITIAL TEST CASE SIMULATION RESULTS................................................................................. 17FIGURE 14: GENERAL TEST CASE -FULL ZOOM ........................................................................................... 18FIGURE 15: GENERAL TEST CASE - EXAMPLE 1 ............................................................................................ 19FIGURE 16: GENERAL TEST CASE - EXAMPLE 2 ............................................................................................ 19FIGURE 17: FORWARD REGISTERS BALANCING (PIPELINING) ....................................................................... 22FIGURE 18: BACKWARD REGISTERS BALANCING (PIPELINING) .................................................................... 22

TABLE 1: SIGNAL DESCRIPTION .................................................................................................................... 11TABLE 2: SYNTHESIS RESULTS (NO PLACEMENT AND ROUTING): XC2V500 -FG456-4 DEVICE ................... 20TABLE 3: SYNTHESIS RESULTS: XC2V1000 FF896-4 DEVICE ................................................................... 21TABLE 4: PLACEMENT AND ROUTING RESULTS FF896-4 DEVICE .............................................................. 21TABLE 5: PLACEMENT AND ROUTING RESULTS OF PIPELINED SKLANSKY ADDER........................................ 23TABLE 6: PLACEMENT AND ROUTING RESULTS OF PIPELINED KOGGE-STONE ADDER................................. 23


4/25

1 Introduction

The objective of this project is to design two different 128-bit adders by going through

the full design cycle from initial concept to structural RTL coding, simulation andsynthesis for Xilinx Virtex-2 FPGA family, device XC2V500.

1.1 Report OrganizationThe report is organized into few sections. Section 1 introduces common principles of

adder designs and structures, briefly describing the Carry Select, Carry Skip and the

Carry Look-Ahead principles with further elaboration on parallel-prefix adders, two ofwhich, Sklansky prefix adder and Kogge-Stone prefix adder, are implemented in this

project. Section 2 describes the design flow and the micro architecture of the design.

Section 3 focuses on the verification and test plan of the designs, followed by section 4describing the results. Finally, sections 5 and 6 finalize the report with the conclusions

and references, respectively.

1.2 Common Adder Structures

1.2.1 1-bit Full Adder

A 1-bit Full Adder is shown on Figure 1. The equations describing the outputs are:

inCBAS =

inout CBABAC += )(

Full

Adder

A

B

S

Cout

Cin

A

B

Cin

S

Cout Figure 1: 1-bit Full Adder

1.2.2 N-bit Ripple Carry Adder

An iterative approach of considering an N-bit full adder leads to cascading of 1-bit full

adders. This concept is illustrated in Figure 2. Obviously, as N increases, the most critical

path, which is the carry path, increases as well ( outC path), linearly.


5/25

Full

Adder

Full

Adder

Full

Adder

1nB 1nA iB iA 0A0B

0SiS1nS

0CiC

outC Figure 2: N-bit Carry Propagate Adder

1.2.3 Carry Skip Adder

Let iii bap = and iii bag = . p denotes "propagate" and denotes "generate".

The basic carry-skip or carry-bypass design is an adder, which divides an N-bit adder into

M

Nblocks, where each block contains M bits. This is shown at Figure 3. Within each

block, a simple M-bit full adder structure is realized (linear time Carry Skip Adder),where "propagate" and "generate" signals for the respective input bits are used to form

the output sum bits and the output carries. The multiplexer at the end of a block, allows

the input carry to bypass the block when all of the "propagate" signals in that block are

asserted. After the carry generate delay of the first block, the bypassing of carries insubsequent blocks results in the carry-propagate delay. If any of the "propagate" signals

in some block is unasserted, then the carry propagation is not dependent on any of the

input carries from the previous blocks and each multiplexer. The critical path delay is

( ) SUMFAMUXFAsetupPD ttKtM

NtMtt ++

++= 11

The subsequent section 1.2.4 explains how the better performance can be achieved by

modifying the block size.

Carry

Propagation

Cin

SUM(M-1)

M M

M

Carry

Propagation

M

Carry

Propagation

M

SUM(M-2) SUM(0)

01 AAM 01 BBM MNN AA 1 MNN BB 1 MKMNKMN AA 1 MKMNKMN BB 1

Cout

Carry

Select

LogicM M

Carry

Select

LogicM M

Carry

Select

Logic

Figure 3: Carry Skip Concept


6/25

1.2.4 Carry Select Adder

This type of adder, despite its bigger amount of hardware needed, it has a very interesting

design concept. The linear Carry Select Adder is divided intoM

Nblocks, where each

block contains M bits, just as Carry Skip Adder. At each block, the hardware is replicated

in order to calculate sum and carry-out bits for both possible carry-ins. Figure 4 illustratesthis concept. The multiplexer at the end chooses between the carry-outs based on the

carry-in from the previous stage. In this implementation, the critical path delay comprises

the carry-generate of the first block, followed by the mux delays for successive blocks.

This results in a linear time Carry Select Adder.Variable-sized blocks can yield higher performance [5]. For a carry-select adder, one canhave increasing sizes of the blocks so that the delay can be minimized by allowing all the

inputs to arrive at the same time at each multiplexer. For example, if the multiplexer

delay is similar to the delay of a full adder, then the minimal carry delay can be achieved

by adding 1 bit in the first block, 2 in the second, and so on. Having linearly increasingblock sizes results in a square-root number of block stages for the carry propagate delay,

and hence a square-root time CSA. A similar approach can yield a square-root timeCSkA.

M-bit AdderCin

SUM(0)

M M

M

M-bit Adder

SUM(1)

01 AAM 01 BBM

M M

M-bit Adder

M M

MM AA 12 MM BB 12 MNN AA 1 MNN BB 1

M-bit Adder

SUM(M-1)

M M

M-bit Adder

M M

MM AA 12 MM BB 12 MNN AA 1 MNN BB 1 Cout Figure 4: Carry Select Concept

1.2.5 Carry Look Ahead Adder

Ripple Carry Adder implementation imposes the sequential generation of the carries,

making the output carry of each stage dependant on the input carry to the stage. CarryLook Ahead implementation implies that the carry-out is not depending on the previous

carries.

Let iii bap = and iii bag = . P denotes "propagate" and G denotes "generate".

Then iii cps = and iiii cpgc +=+1

Expanding the above given equations for N-bit adder gives:

0001 cpgc +=


7/25

0011112 cppcpgc ++=

0012100121211 ............ CPPPPGPPppgpgc nnnnnnnn +++++=

It can be easily seen that since the carry is not depending on the previous carries, this

would result in less delay, as the adder circuit can be implemented as sum of products.

Consequently, an increase in the speed can be achieved. Unfortunately, due to the factthat CMOS delay increases non-linearly as the fan-in grows, Carry Look Ahead

implementation is used in a modular way, cascading several 4-bit CLAs.

1.2.6 Prefix Adders

In very simple words, a parallel prefix algorithm takes n inputs 021 ,...,, xxx nn and

produces in parallel n outputs 002021 ,...,...,... xxxxxx nnn . The analogy between carry

computation and the prefix algorithm is that the carry computation at a certain stage i

depends on all inputs of the stages 1i to 0 .Let 021 ,...,, aaa nn and 021 ,...,, bbb nn be n-bit binary numbers to be added. Let oc

designate the input carry and nc designate the output carry. For each bit, "propagate"

( ip ) and "generate" ( ig ) signals are defined, as described in the previous section.

Furthermore, for parallelizing the computation of a carry two additional terms are

defined: Group Carry Generate ( jiG : ) and Group Carry Propagate ( jiP: ).

For each group of bits the Group Carry Generate signal jiG : means that the carry is

generated somewhere between stages i and j , and it is propagated from that location to

stage i . This implies 11 =+ic and, in particular, if 0=j , then ii cG =0: .

For each group of bits the Group Carry Propagate signal jiP: means that the carry is

propagated from stage j to stage i , i.e.ji

cc =+1

.

So the formal definition of jiG : and jiP: is expressed using the following relationship:

[ ]iijiji pgPG ,, :: = if ji =

[ ] jkjkkikijiji PGPGPG :::::: ,,, D= if ji Where jki and "D " operator is introduced by Brent and Kung [1].

Finally, once the final carries 0:iG for all ni < have been computed, the sub bits are

calculated as:

=

>>=

0,

1,0:

ip

inGps

i

ii

i

The traditional CRA can be regarded as serial prefix adder using the above definitions.


8/25

1.2.7 Sklansky Prefix Adder

Sklansky Prefix tree is shown on Figure 5 for 16-bit adder. Its structure is the simplestamong the prefix adders. It used for a conditional-sum addition [2]. The fan-out of such

adder grows exponentially from input to output along the critical path and it is2

n. This

leads to a large delay as the adder operands width increases. Recursive division of theblocks can construct full adder using such a tree for the implementation. The number of

"o" cells required to implement is nn

2log2

and the delay is n2log , where n is the

adders width. The detailed implementation of "o" cell is described in 2.1.2.2

0123456789101112131415

Figure 5: Sklansky Prefix Tree


9/25

1.2.8 Kogge-Stone Prefix Adder

The Kogge-Stone structure has a more optimal implementation than Sklansky structure,as its fan-out is greatly reduced to 2 at the expense of larger "o" (circle) cells. It is

obtained by copying the of the most significant bit position [3]. Figure 6 shows this prefix

tree for 16-bit operands.

Just as in 1.2.7, recursive division of the blocks can construct full adder using such a treefor the implementation. The number of "o" cells required for the implementation is

1log2 + nnn and the delay is n2log , where n is the adders width. It is expected that

Kogge-Stone adder should consume more resources than Sklansky adder. The delay is 7

levels.

0123456789101112131415

Figure 6: Kogge-Stone Prefix Tree


10/25

2 Design Flow & Implementation

The following Figure 7 illustrates design flow for the implementation of prefix adders.

Design Specification

Macro Architecture

VHDL RTL Coding

Structural Level

(Emacs VHDL mode)

Simulation

ModelSim 6.0 SE

Synthesis

Place and Route

Xilinx ISE 6.3 SP3

Compare

Results

Test Bench

PRBS

Generator

Analyze

Results

Verification Plan

Test Case

Specification

Results

Results

Results

Reports

Figure 7: Design Flow


11/25

2.1 Micro Architecture

2.1.1 Top Entity

The following Figure 8 illustrates top-level view. The top entity is named

full_adder_sklansky and full_adder_kogge_stone, respectively, with the following

ports (Table 1).

full_adder_sklansky

or

full_adder_kogge_stone

operand1

operand2

sys_clk

128

128

result

128

carry_out

reset_n

Figure 8: Top Level View

Signal Name Width, [bits] Direction Comments

operand1 128 input Number #1 to be added

operand2 128 input Number #2 to be added

sys_clk 1 input System clock

reset_n 1 input System reset (active low)

result 128 output Result of an addition

carry_out 1 output Output carry resulting from an addition

Table 1: Signal Description


12/25

2.1.2 Sub-Block Partitioning

The top-level block is further partitioned into three sub-blocks, as it is shown on Figure 9.No doubt, the choices of block partitioning are numerous. It is chosen to partition the

design into three sub-blocks due to the fact that in such block partitioning the two

different adders designs differ only by one sub-block, which is Carry Generation Block

(cg_gen). Consequently, two different sub-blocks are designed: cg_gen_sklansky andcg_gen_kogge_stone.

pg_gen ("Carry Propagate"&"Carry Generate" Block)

operand1[0]operand2[0]operand1[127]operand2[127]

cg_gen_sklansky

cg_gen_kogge_stone

(2-D Carry Generation Block)

g[0](0)g[127](0)

sb_gen (Sum Bits Generation Block)

g[127](M-1)

s[127] s[126] s[0]s[1]carry_out

p[0]

p[127]

g[0](M-1)

Figure 9: full_adder sub-block partitioning

The subsequent sections elaborate on each one of the sub-blocks.


13/25

2.1.2.1 "Carry Propagate" and "Carry Generate" Block (pg_gen)

This sub-block calculates "carry propagate" )0]([ip and "carry generate" )0]([ig , which

are calculated from operand1 and operand2 bitwise, as defined in 1.2.5, namely:

][2][1)0]([ ioperandioperandip =

][2][1)0]([ ioperandioperandig =

The implementation is shown on Figure 10.This block consumes 128 2-input AND gates and 128 2-input XOR gates.

operand1[1]operand2[1]operand1[127]operand2[127] operand1[0]operand2[0]

g[127] p[127] g[i] p[i] g[1] p[1] g[0] p[0]

Figure 10: "Carry Generate" and "Carry Propagate" Block Implementation


14/25

2.1.2.2 Carry Generation Block

The signals )0]([ip and )0]([ig generated in Precondition Block are used within Carry

Generation Block for calculation the )1]([ Mig signals, which could be represented as

two-dimensional carry generate structure. Further subsequent sections describe theimplementation of Carry Generation Block for each one of the chosen designs.

2.1.2.2.1 Carry Generation Block Sklansky Prefix Adder (cg_gen_sklansky)

Following the Sklansky prefix tree (presented in 1.2.7), the following observation is

determined (assuming a two-dimensional structure j rows by i columns):

In the column i , cells occupy the nodes whose row coordinates j correspond to

"1" in the binary representation of i , i.e. straight forward from binary encoding of

the index i . The coordinate corresponding to "0" in the binary representation of i

simply propagates the )]([ jip and )]([ jig

All "o" (circle) cells are of GP type except of those situated in the bottom borderof ij 2log< .

The output of GP cell is defined as following:

)1](12mod[)1]([)1]([)]([ 1 += jiigjipjigjig j

The output of G cell is defined as following:

)1](12mod][)1]([)]([ 1 = jiipjipjip j

)1](12mod[)1]([)1]([)]([ 1 += jiigjipjigjig j

Following the prefix algorithm description, with 128=n the implementation consumes448 "o" cells, namely 448 2-input OR gates and the same amount of 2-input AND gates.

The delay is 7 levels and the fan-out is 64.

2.1.2.2.2 Carry Generation Block Kogge-Stone Prefix Adder

(cg_gen_kogge_stone)

Following the Kogge-Stone prefix tree (presented in 1.2.8) and assuming a two-

dimensional structure j rows by i columns, the nodes in the upper-left are populated

with "o" (circle) cells, while the rest of the two-dimensional array is empty, i.e. the "o"

(circle) cells are placed in the nodes whose coordinates satisfy the following relationship:

11 Mj and 112 1 + Nij

The outputs of the placed cells are: )1](2[)1]([)]([ 1 = jipjipjip j

)1](2[)1]([)1]([)]([ 1 += jigjipjigjig j

Following the prefix algorithm description, with 128=n the implementation consumes769 "o" cells, hence occupying 769 2-input OR gates and the same amount of 2-inputAND gates.


15/25

2.1.2.3 Sum Bits Generation Block (sb_gen)

The sum bits are produced in Sum Bits Generation Block by XORing the "carry

propagate" signals, )0]([ip , generated in Precondition Block, and the "carry generate"

bits )1]([ Mig . Figure 11 illustrates the implementation, which is consuming 128 2-

input XOR gates.

p[1]g[0](M-1)p[127]g[126](M-1)

s[127] s[i] s[1]

carry_inp[0]

s[0]

Figure 11: Sum Bits Generation Block Implementation

2.2 RTL Coding

RTL coding is done in VHDL at the structural level. The basic cells are 2-input AND

gate, 2-input OR gate, 2-input XOR gate and D-type positive edge triggered flip flop.The text editor used is emacs version 20.7 with vhdl mode, since it has many templates

for arranging VHDL code in an alignment, which is easy to read. Each one of the files

has a header at the top explaining the entity name and its logical function.

2.3 Verification Plan

In general, describing the same design functionality (especially of a large and complexdesign) by a high-level language, such as C/C++ or using verification tools, such as

Verisity Specman, etc, is the way to verify the design in many scenarios with many

possible input combinations.For the verification of the two full adders, the following is proposed (Figure 12).

A test bench, which is written in behavioral Verilog, instantiates both designs. Two 128-

bit numbers are generated using a dedicated LFSR (Linear Feedback Shift Register) [4],which generates pseudo-random bit stream.

Each clock cycle, the values of two 128-bit numbers change in pseudo-random way.

These values are summed using a '+' operation within the test bench and they are also

applied as inputs to both adders. The resulting output sum and carry of each adder iscompared with the result generated by '+' addition within the test bench.


16/25

A successful test case (test passed) is defined as the match between the result of a test

bench and the result of each adder.

128-bit PRBS

Generator 1

128-bit PRBS

Generator 2

operand1+operand2

full_adder_sklansky128

128

result

128

carry_out

test_bench

operand1[127:0]

operand2[127:0]

result[127:0]

carry_out

test_bench

results file

128

128

full_adder_kogge_stone

operand1

operand2

operand1

operand2

result

128

carry_out

match_sklansky

match_kogge_stone

128

128

Figure 12: Test Bench & Verification Plan

2.4 Synthesis, Place and Route

Synthesis, placement and routing of the design are done using Xilinx ISE 6.3i software

with the latest service pack SP3. The constraints are set for the best timing, by selectingthe optimization criteria "speed" with the maximum effort. More details on the results, aswell as the faced problems, are given in the section 3.2

3 Results


17/25

3.1 Simulation Results

3.1.1 Initial Test Cases

The initial test cases are defined as the sum of the following 128-bit numbers.

The very first case verifies the sum of the following numbers:

64 zeros followed by 64 ones. 64 ones followed by 64 zeros.The next case is:

32 repetitions of 0xA. 32 repetitions of 0x5.

In such fashion, the possible bit swapping or incorrect index generation is tested.Figure 13 illustrates the simulation results for the initial test case.

operand1 and operand2 are, effectively, the two 128-bit numbers to be added.

result and carry_out are outputs of each one of the adders marked by the appropriatedivider (Sklansky Adder and Kogge-Stone Adder, respectively).

Figure 13: Initial Test Case Simulation Results

3.1.2 General Test Case

In general test case, the data is generated in a pseudo-random way, as described in the

section 2.3. Three snapshots of the simulation results are given in the following figures.

Figure 14 illustrates the entire simulation. The lowest divider separates the test benchsignals. operand1_prbs and operand2_prbs are the 128-bit PRBS data, which is applied

to the adders. operand1 and operand2 are the input numbers; result and carry_out are

the outputs of the adder circuits, marked by the corresponding divider (Sklansky Adderand Kogge-Stone Adder, respectively). Two more very important test bench signals are

result_match_sklansky and result_match_kogge_stone, which are updated each clock

cycle, depending whether there is a match between the test bench result and therespective result of Sklansky adder and Kogge-Stone Adder.

Figure 15 and Figure 16 are giving two "zoom-in" examples of the same simulation.


18/25

Figure 14: General Test Case - Full Zoom


19/25

Figure 15: General Test Case - Example 1

Figure 16: General Test Case - Example 2


20/25

3.2 Synthesis Results

Both designs were successfully synthesized for Virtex-2 device XC2V500. The synthesisresults are summarized in the following Table 2. It is noted that Kogge-Stone adder

consumes more resources than Sklansky adder, just as it was expected.

Results Explanation (Table 2): The input and outputs of the design were sampled inorder to achieve more true delay estimation, assuming that the inputs and the outputs of

the design are registered. Furthermore, in the placement and routing stage, a specific

option, which forces the flip-flops to be packed within the I/O buffer, is selected, so thatthe logic delay represents true estimation of each adders processing delay in this FPGA

implementation.

However, due to the fact the maximum available user I/O pins for this device is 264

(package FG456), further placement and routing of the design, and, hence, the true

estimation of its logic delay is not possible. Consequently, there are two alternatives. One

alternative is multiplexing the I/Os in order to fit the design into XC2V500 device.

Another alternative is to select a larger device, which is XC2V1000.Both the alternatives are described in the following subsections.

Table 2: Synthesis Results (No placement and routing): XC2V500 -FG456-4 device

Design LUTs usage 1-bit Registers

Usage

Total Slices

Usage

Maximum

Frequency

Sklansky

Adder

829 (13%) 385 (6%) 453 (14%) 85.6 MHz

Kogge-StoneAdder

1449 (23%) 385 (6%) 751 (24%) 100.5 MHz

3.2.1 Multiplexing I/O

This alternative requires complete redesigning of the interface and changing the overall

architecture of the design. Either loading the numbers or outputting the result inmultiplexed way could have advantages and disadvantages, which are summarized

further. In addition, handshaking signals, which designate the start of loading and the

completion of the addition, are required.

3.2.1.1 Multiplexed Inputs

In this case, it is obvious that the design latency (overall processing time) will increase,since the whole input numbers cannot be acquired at once. However, there are two major

advantages that could be achieved. First, the logic required for the addition could be

reduced, since the logic performing the addition cannot process more bits than are presenton the interface at the same cycle. Consequently, the addition could be performed in

multiplexed fashion, especially if the loading of the input numbers is done in the way that

the least significant part of the numbers is loaded first. Second, that the overall speed of

the design will definitely increase as the complexity and combinational levels of logicdecrease as well.


21/25


22/25

4 Design Enhancement Pipelining

The pipelining of the design is introduced in order to improve the design speed. There aretwo ways of applying pipelining. One, manual, is to locate the exact point at the critical

path, which has an arrival time of exactly half the total delay of the critical path (or one

third, if two pipeline stages are inferred, and so on) and insert a pipeline there. Another

alternative, automatic pipelining, is described below.The location of the pipelining registers location is chosen automatically by Xilinx

synthesis tool. In the design, N pipeline stages are added to the inputs, the outputs or

both inputs and outputs of a design and the software optimizes the location of the pipelineregisters according to specified timing requirements and synthesis effort by moving them

forward and backward. This is also referred as "forward/backward register balancing" in

the tools (Xilinx ISE [6]) and "retiming" (Synplicity Synplify Pro 7.xx [7]) and it is

illustrated at Figure 17 and Figure 18. The software automatically determines Td1 andTd2 corresponding to the given timing constraints and synthesis effort.

Pipelinestage

Pipelinestage

sys_clk

PipelinestageTd

Pipeline

stage

sys_clk

Td1 Pipelinestage

Td2 Pipelinestage

Td = Td1 + Td2 Figure 17: Forward Registers Balancing (Pipelining)

Pipeline

stage

sys_clk

Pipeline

stage

Td

Pipelinestage

sys_clk

Td1 Pipelinestage

Td2 Pipelinestage

Td = Td1 + Td2

Pipeline

stage

Figure 18: Backward Registers Balancing (Pipelining)


23/25

Table 5 gives the result of automatic pipelining of Sklansky Adder.

Table 6 gives the result of automatic pipelining of Kogge-Stone Adder.

From the results, it is observed that:

Adding one output pipeline stage improves the timing, while adding two pipeline

stages does not. The main reason is the fact that the delay distribution, consists ofapproximately 25%-30% logic delay and approximately 70% routing delay.

Despite that adding 2 pipeline stages improves flip-flop to flip-flop delay, due tothe routing delay, the total delay is worse than with only 1 pipeline stage.

One other important factor that might prevent from achieving the goodperformance could be the high usage of I/O pins, which imposes another level of

complexity for the place and route tool.

The faster a certain path is, the more percentage of it is contributed by the actuallogic delay.

Multiple iterations of synthesis, place and route produce slightly different results.

Number ofPipeline Stages

Total Slices Usage Maximum Delay /Frequency

Delay DistributionLogic % / Routing %

1 input stage 551 (11%) 10.895 ns / 91.7 MHz 33 / 67

2 input stages 746 (14%) 9.9 ns / 101 MHz 36 / 64

1 output stage 603 (11%) 12.174 ns / 82.1 MHz 32 / 68

2 output stages 630 (12%) 12.644 ns / 79.1 MHz 27 / 73

1 stage at input

and output

571 (11%) 8.905 ns / 112.2 MHz 43 / 57

2 stages at input

and output

777 (15%) 8.698 ns / 114.9 MHz 45 / 55

Table 5: Placement and Routing Results of Pipelined Sklansky Adder

Number of

Pipeline Stages

Total Slices Usage Maximum Delay /

Frequency

Delay Distribution

Logic % / Routing %

1 input stage 838 (16%) 11.112 ns / 89.9 MHz 32 / 68

2 input stages 948 (18%) 10.597 ns / 94.36 MHz 28 / 72

1 output stage 852 (16%) 8.802 ns / 113.6 MHz 30 / 70

2 output stages 933 (18%) 9.286 ns / 107.68 MHz 41 / 69

1 stage at inputand output

888 (17%) 7.724 ns / 129.4 MHz 43 / 57

2 stages at input

and output

1075 (%) 7.612 ns / 131.3 MHz 47 / 53

Table 6: Placement and Routing Results of Pipelined Kogge-Stone Adder


24/25

5 Summary and Conclusions

Two different parallel prefix 128-bit adders were designed, analyzed and tested.

In the beginning of the design process, it was noted that the required device (XC2V500)

couldnt accommodate the requirements because of the limited number of the available

user I/O pins. Two alternatives were discussed and considered for further step of thedesign: using the multiplexed I/O and, hence, reducing the overall number of the used

I/Os or changing the target device to XC2V1000. The second alternative was chosen

because it did not require redesigning and involving other levels of complexity.

It was observed that due to the nature of Kogge-Stone prefix, the expected resource usage

of Kogge-Stone adder will be greater comparing with Sklansky adder and it was justifiedby the results.

It was also observed that multiple iterations of the same designs synthesis sometimes

produce slightly different placement results in terms of logic resources usage and timing.

The reason for this is the fact that the placement and routing algorithm used by Xilinxtools is based on randomized initial settings [6], [8], in opposite to Altera [7].

Pipelining by inserting a number of pipeline stages enhanced the designs and the results

were analyzed. It turns out that the pipelining is not necessary improving the design

speed. The main reason for this is that the delay distribution in most cases consists ofapproximately 20% to 40% of the actual logic and the rest, which is 80% down to 60%,

respectively, of routing delay. So, it is concluded that adding more pipeline stages does

not necessary improves the total delay.


25/25

6 References

[1] R. T. Brent and H. T. Kung "A regular layout of parallel adders", IEEE Trans.Comput. Vol. C-31, No 3, pp. 260-264, March 1982

[2] J. Sklansky "Conditional-sum Addition Logic", in IRE transactions of electronic

Computers, Vol. EC-9, No 2, pp. 226-231, June 1960

[3] P. M. Kogge and H. S. Stone "A parallel algorithm for the efficient solution of a

general class of recurrence qeuations, IEEE Transactions on computers. C-22(8):260 264. Aug 1973

[4] Paul H. Bardell, William H. McAnney, and Jacob Savir, "Built-In Test for VLSI:Pseudorandom Techniques", John Wiley & Sons, New York, 1987

[5] V. G. Oklobdzija, E. R. Barnes, "Some Optimal Schemes for ALU Implementation in

VLSI Technology", Proceedings of the 7th Symposium on Computer Arithmetic ARITH-

7, pp. 2-8. Reprinted in Computer Arithmetic, E. E. Swartzlander, (editor), Vol. II, pp.137-142, 1985.

[6] Xilinx Programmable Logic Devices PLD & FPGA, www.xilinx.com

[7] Synplicity Synplify Pro 7.02 users guide www.synplicity.com

[8] Xilinx ISE 6.2 / 6.3 users manual www.xilinx.com

Date post:	14-Apr-2018
Category:	Documents
Upload:	yermakov-vadim-ivanovich
View:	218 times
Download:	0 times

Muravin Project

Documents