+ All Categories
Home > Documents > Muravin Project

Muravin Project

Date post: 14-Apr-2018
Category:
Upload: yermakov-vadim-ivanovich
View: 218 times
Download: 0 times
Share this document with a friend

of 25

Transcript
  • 7/27/2019 Muravin Project

    1/25

    Design of Two Different 128-bit Adders

    Project Report

    By Vladislav Muravin

    Concordia ID: 5505763

    COEN6501: Digital Design & Synthesis

    Offered by Professor Asim Al-Khalili

    Concordia University

    December 2004

  • 7/27/2019 Muravin Project

    2/25

    Table of Contents

    1 INTRODUCTION............................................................................................................................... 4

    1.1 REPORT ORGANIZATION............................................................................................................... 41.2 COMMON ADDERSTRUCTURES.................................................................................................... 4

    1.2.1 1-bit Full Adder ...................................................................................................................... 4

    1.2.2 N-bit Ripple Carry Adder ....................................................................................................... 41.2.3 Carry Skip Adder.................................................................................................................... 51.2.4 Carry Select Adder ................................................................................................................. 61.2.5 Carry Look Ahead Adder........................................................................................................ 61.2.6 Prefix Adders.......................................................................................................................... 71.2.7 Sklansky Prefix Adder............................................................................................................. 81.2.8 Kogge-Stone Prefix Adder ...................................................................................................... 9

    2 DESIGN FLOW & IMPLEMENTATION..................................................................................... 10

    2.1 MICRO ARCHITECTURE .............................................................................................................. 112.1.1 Top Entity ............................................................................................................................. 112.1.2 Sub-Block Partitioning ......................................................................................................... 12

    2.1.2.1 "Carry Propagate" and "Carry Generate" Block (pg_gen) ......................................................... 132.1.2.2

    Carry Generation Block ............................................................................................................. 14

    2.1.2.2.1 Carry Generation Block Sklansky Prefix Adder (cg_gen_sklansky) ................................. 142.1.2.2.2 Carry Generation Block Kogge-Stone Prefix Adder (cg_gen_kogge_stone)..................... 14

    2.1.2.3 Sum Bits Generation Block (sb_gen) ......................................................................................... 152.2 RTL CODING ............................................................................................................................. 152.3 VERIFICATION PLAN .................................................................................................................. 152.4 SYNTHESIS, PLACE AND ROUTE ................................................................................................. 16

    3 RESULTS.......................................................................................................................................... 16

    3.1 SIMULATION RESULTS ............................................................................................................... 173.1.1 Initial Test Cases .................................................................................................................. 173.1.2 General Test Case ................................................................................................................ 17

    3.2 SYNTHESIS RESULTS .................................................................................................................. 203.2.1 Multiplexing I/O ................................................................................................................... 20

    3.2.1.1 Multiplexed Inputs ..................................................................................................................... 203.2.1.2 Multiplexed Outputs .................................................................................................................. 213.2.1.3 Multiplexed Inputs and Outputs ................................................................................................. 21

    3.2.2 Changing Target Device....................................................................................................... 21

    4 DESIGN ENHANCEMENT PIPELINING................................................................................. 22

    5 SUMMARY AND CONCLUSIONS ............................................................................................... 24

    6 REFERENCES.................................................................................................................................. 25

  • 7/27/2019 Muravin Project

    3/25

    Table of FiguresFIGURE 1: 1-BIT FULL ADDER......................................................................................................................... 4FIGURE 2: N-BIT CARRY PROPAGATE ADDER................................................................................................. 5FIGURE 3: CARRY SKIP CONCEPT.................................................................................................................... 5FIGURE 4: CARRY SELECT CONCEPT ............................................................................................................... 6FIGURE 5: SKLANSKY PREFIX TREE ................................................................................................................ 8

    FIGURE 6: KOGGE-STONE PREFIX TREE .......................................................................................................... 9FIGURE 7: DESIGN FLOW............................................................................................................................... 10FIGURE 8: TOP LEVEL VIEW.......................................................................................................................... 11FIGURE 9: FULL_ADDER SUB-BLOCK PARTITIONING ...................................................................................... 12FIGURE 10: "CARRY GENERATE" AND "CARRY PROPAGATE" BLOCKIMPLEMENTATION ............................. 13FIGURE 11: SUM BITS GENERATION BLOCKIMPLEMENTATION .................................................................... 15FIGURE 12: TEST BENCH & VERIFICATION PLAN.......................................................................................... 16FIGURE 13: INITIAL TEST CASE SIMULATION RESULTS................................................................................. 17FIGURE 14: GENERAL TEST CASE -FULL ZOOM ........................................................................................... 18FIGURE 15: GENERAL TEST CASE - EXAMPLE 1 ............................................................................................ 19FIGURE 16: GENERAL TEST CASE - EXAMPLE 2 ............................................................................................ 19FIGURE 17: FORWARD REGISTERS BALANCING (PIPELINING) ....................................................................... 22FIGURE 18: BACKWARD REGISTERS BALANCING (PIPELINING) .................................................................... 22

    TABLE 1: SIGNAL DESCRIPTION .................................................................................................................... 11TABLE 2: SYNTHESIS RESULTS (NO PLACEMENT AND ROUTING): XC2V500 -FG456-4 DEVICE ................... 20TABLE 3: SYNTHESIS RESULTS: XC2V1000 FF896-4 DEVICE ................................................................... 21TABLE 4: PLACEMENT AND ROUTING RESULTS FF896-4 DEVICE .............................................................. 21TABLE 5: PLACEMENT AND ROUTING RESULTS OF PIPELINED SKLANSKY ADDER........................................ 23TABLE 6: PLACEMENT AND ROUTING RESULTS OF PIPELINED KOGGE-STONE ADDER................................. 23

  • 7/27/2019 Muravin Project

    4/25

    1 Introduction

    The objective of this project is to design two different 128-bit adders by going through

    the full design cycle from initial concept to structural RTL coding, simulation andsynthesis for Xilinx Virtex-2 FPGA family, device XC2V500.

    1.1 Report OrganizationThe report is organized into few sections. Section 1 introduces common principles of

    adder designs and structures, briefly describing the Carry Select, Carry Skip and the

    Carry Look-Ahead principles with further elaboration on parallel-prefix adders, two ofwhich, Sklansky prefix adder and Kogge-Stone prefix adder, are implemented in this

    project. Section 2 describes the design flow and the micro architecture of the design.

    Section 3 focuses on the verification and test plan of the designs, followed by section 4describing the results. Finally, sections 5 and 6 finalize the report with the conclusions

    and references, respectively.

    1.2 Common Adder Structures

    1.2.1 1-bit Full Adder

    A 1-bit Full Adder is shown on Figure 1. The equations describing the outputs are:

    inCBAS =

    inout CBABAC += )(

    Full

    Adder

    A

    B

    S

    Cout

    Cin

    A

    B

    Cin

    S

    Cout Figure 1: 1-bit Full Adder

    1.2.2 N-bit Ripple Carry Adder

    An iterative approach of considering an N-bit full adder leads to cascading of 1-bit full

    adders. This concept is illustrated in Figure 2. Obviously, as N increases, the most critical

    path, which is the carry path, increases as well ( outC path), linearly.

  • 7/27/2019 Muravin Project

    5/25

    Full

    Adder

    Full

    Adder

    Full

    Adder

    1nB 1nA iB iA 0A0B

    0SiS1nS

    0CiC

    outC Figure 2: N-bit Carry Propagate Adder

    1.2.3 Carry Skip Adder

    Let iii bap = and iii bag = . p denotes "propagate" and denotes "generate".

    The basic carry-skip or carry-bypass design is an adder, which divides an N-bit adder into

    M

    Nblocks, where each block contains M bits. This is shown at Figure 3. Within each

    block, a simple M-bit full adder structure is realized (linear time Carry Skip Adder),where "propagate" and "generate" signals for the respective input bits are used to form

    the output sum bits and the output carries. The multiplexer at the end of a block, allows

    the input carry to bypass the block when all of the "propagate" signals in that block are

    asserted. After the carry generate delay of the first block, the bypassing of carries insubsequent blocks results in the carry-propagate delay. If any of the "propagate" signals

    in some block is unasserted, then the carry propagation is not dependent on any of the

    input carries from the previous blocks and each multiplexer. The critical path delay is

    ( ) SUMFAMUXFAsetupPD ttKtM

    NtMtt ++

    ++= 11

    The subsequent section 1.2.4 explains how the better performance can be achieved by

    modifying the block size.

    Carry

    Propagation

    Cin

    SUM(M-1)

    M M

    M

    Carry

    Propagation

    M

    Carry

    Propagation

    M

    SUM(M-2) SUM(0)

    01 AAM 01 BBM MNN AA 1 MNN BB 1 MKMNKMN AA 1 MKMNKMN BB 1

    Cout

    Carry

    Select

    LogicM M

    Carry

    Select

    LogicM M

    Carry

    Select

    Logic

    Figure 3: Carry Skip Concept

  • 7/27/2019 Muravin Project

    6/25

    1.2.4 Carry Select Adder

    This type of adder, despite its bigger amount of hardware needed, it has a very interesting

    design concept. The linear Carry Select Adder is divided intoM

    Nblocks, where each

    block contains M bits, just as Carry Skip Adder. At each block, the hardware is replicated

    in order to calculate sum and carry-out bits for both possible carry-ins. Figure 4 illustratesthis concept. The multiplexer at the end chooses between the carry-outs based on the

    carry-in from the previous stage. In this implementation, the critical path delay comprises

    the carry-generate of the first block, followed by the mux delays for successive blocks.

    This results in a linear time Carry Select Adder.Variable-sized blocks can yield higher performance [5]. For a carry-select adder, one canhave increasing sizes of the blocks so that the delay can be minimized by allowing all the

    inputs to arrive at the same time at each multiplexer. For example, if the multiplexer

    delay is similar to the delay of a full adder, then the minimal carry delay can be achieved

    by adding 1 bit in the first block, 2 in the second, and so on. Having linearly increasingblock sizes results in a square-root number of block stages for the carry propagate delay,

    and hence a square-root time CSA. A similar approach can yield a square-root timeCSkA.

    M-bit AdderCin

    SUM(0)

    M M

    M

    M-bit Adder

    SUM(1)

    01 AAM 01 BBM

    M M

    M-bit Adder

    M M

    MM AA 12 MM BB 12 MNN AA 1 MNN BB 1

    M-bit Adder

    SUM(M-1)

    M M

    M-bit Adder

    M M

    MM AA 12 MM BB 12 MNN AA 1 MNN BB 1 Cout Figure 4: Carry Select Concept

    1.2.5 Carry Look Ahead Adder

    Ripple Carry Adder implementation imposes the sequential generation of the carries,

    making the output carry of each stage dependant on the input carry to the stage. CarryLook Ahead implementation implies that the carry-out is not depending on the previous

    carries.

    Let iii bap = and iii bag = . P denotes "propagate" and G denotes "generate".

    Then iii cps = and iiii cpgc +=+1

    Expanding the above given equations for N-bit adder gives:

    0001 cpgc +=

  • 7/27/2019 Muravin Project

    7/25

    0011112 cppcpgc ++=

    0012100121211 ............ CPPPPGPPppgpgc nnnnnnnn +++++=

    It can be easily seen that since the carry is not depending on the previous carries, this

    would result in less delay, as the adder circuit can be implemented as sum of products.

    Consequently, an increase in the speed can be achieved. Unfortunately, due to the factthat CMOS delay increases non-linearly as the fan-in grows, Carry Look Ahead

    implementation is used in a modular way, cascading several 4-bit CLAs.

    1.2.6 Prefix Adders

    In very simple words, a parallel prefix algorithm takes n inputs 021 ,...,, xxx nn and

    produces in parallel n outputs 002021 ,...,...,... xxxxxx nnn . The analogy between carry

    computation and the prefix algorithm is that the carry computation at a certain stage i

    depends on all inputs of the stages 1i to 0 .Let 021 ,...,, aaa nn and 021 ,...,, bbb nn be n-bit binary numbers to be added. Let oc

    designate the input carry and nc designate the output carry. For each bit, "propagate"

    ( ip ) and "generate" ( ig ) signals are defined, as described in the previous section.

    Furthermore, for parallelizing the computation of a carry two additional terms are

    defined: Group Carry Generate ( jiG : ) and Group Carry Propagate ( jiP: ).

    For each group of bits the Group Carry Generate signal jiG : means that the carry is

    generated somewhere between stages i and j , and it is propagated from that location to

    stage i . This implies 11 =+ic and, in particular, if 0=j , then ii cG =0: .

    For each group of bits the Group Carry Propagate signal jiP: means that the carry is

    propagated from stage j to stage i , i.e.ji

    cc =+1

    .

    So the formal definition of jiG : and jiP: is expressed using the following relationship:

    [ ]iijiji pgPG ,, :: = if ji =

    [ ] jkjkkikijiji PGPGPG :::::: ,,, D= if ji Where jki and "D " operator is introduced by Brent and Kung [1].

    Finally, once the final carries 0:iG for all ni < have been computed, the sub bits are

    calculated as:

    =

    >>=

    0,

    1,0:

    ip

    inGps

    i

    ii

    i

    The traditional CRA can be regarded as serial prefix adder using the above definitions.

  • 7/27/2019 Muravin Project

    8/25

    1.2.7 Sklansky Prefix Adder

    Sklansky Prefix tree is shown on Figure 5 for 16-bit adder. Its structure is the simplestamong the prefix adders. It used for a conditional-sum addition [2]. The fan-out of such

    adder grows exponentially from input to output along the critical path and it is2

    n. This

    leads to a large delay as the adder operands width increases. Recursive division of theblocks can construct full adder using such a tree for the implementation. The number of

    "o" cells required to implement is nn

    2log2

    and the delay is n2log , where n is the

    adders width. The detailed implementation of "o" cell is described in 2.1.2.2

    0123456789101112131415

    Figure 5: Sklansky Prefix Tree

  • 7/27/2019 Muravin Project

    9/25

    1.2.8 Kogge-Stone Prefix Adder

    The Kogge-Stone structure has a more optimal implementation than Sklansky structure,as its fan-out is greatly reduced to 2 at the expense of larger "o" (circle) cells. It is

    obtained by copying the of the most significant bit position [3]. Figure 6 shows this prefix

    tree for 16-bit operands.

    Just as in 1.2.7, recursive division of the blocks can construct full adder using such a treefor the implementation. The number of "o" cells required for the implementation is

    1log2 + nnn and the delay is n2log , where n is the adders width. It is expected that

    Kogge-Stone adder should consume more resources than Sklansky adder. The delay is 7

    levels.

    0123456789101112131415

    Figure 6: Kogge-Stone Prefix Tree

  • 7/27/2019 Muravin Project

    10/25

    2 Design Flow & Implementation

    The following Figure 7 illustrates design flow for the implementation of prefix adders.

    Design Specification

    Macro Architecture

    VHDL RTL Coding

    Structural Level

    (Emacs VHDL mode)

    Simulation

    ModelSim 6.0 SE

    Synthesis

    Place and Route

    Xilinx ISE 6.3 SP3

    Compare

    Results

    Test Bench

    PRBS

    Generator

    Analyze

    Results

    Verification Plan

    Test Case

    Specification

    Results

    Results

    Results

    Reports

    Figure 7: Design Flow

  • 7/27/2019 Muravin Project

    11/25

    2.1 Micro Architecture

    2.1.1 Top Entity

    The following Figure 8 illustrates top-level view. The top entity is named

    full_adder_sklansky and full_adder_kogge_stone, respectively, with the following

    ports (Table 1).

    full_adder_sklansky

    or

    full_adder_kogge_stone

    operand1

    operand2

    sys_clk

    128

    128

    result

    128

    carry_out

    reset_n

    Figure 8: Top Level View

    Signal Name Width, [bits] Direction Comments

    operand1 128 input Number #1 to be added

    operand2 128 input Number #2 to be added

    sys_clk 1 input System clock

    reset_n 1 input System reset (active low)

    result 128 output Result of an addition

    carry_out 1 output Output carry resulting from an addition

    Table 1: Signal Description

  • 7/27/2019 Muravin Project

    12/25

    2.1.2 Sub-Block Partitioning

    The top-level block is further partitioned into three sub-blocks, as it is shown on Figure 9.No doubt, the choices of block partitioning are numerous. It is chosen to partition the

    design into three sub-blocks due to the fact that in such block partitioning the two

    different adders designs differ only by one sub-block, which is Carry Generation Block

    (cg_gen). Consequently, two different sub-blocks are designed: cg_gen_sklansky andcg_gen_kogge_stone.

    pg_gen ("Carry Propagate"&"Carry Generate" Block)

    operand1[0]operand2[0]operand1[127]operand2[127]

    cg_gen_sklansky

    cg_gen_kogge_stone

    (2-D Carry Generation Block)

    g[0](0)g[127](0)

    sb_gen (Sum Bits Generation Block)

    g[127](M-1)

    s[127] s[126] s[0]s[1]carry_out

    p[0]

    p[127]

    g[0](M-1)

    Figure 9: full_adder sub-block partitioning

    The subsequent sections elaborate on each one of the sub-blocks.

  • 7/27/2019 Muravin Project

    13/25

    2.1.2.1 "Carry Propagate" and "Carry Generate" Block (pg_gen)

    This sub-block calculates "carry propagate" )0]([ip and "carry generate" )0]([ig , which

    are calculated from operand1 and operand2 bitwise, as defined in 1.2.5, namely:

    ][2][1)0]([ ioperandioperandip =

    ][2][1)0]([ ioperandioperandig =

    The implementation is shown on Figure 10.This block consumes 128 2-input AND gates and 128 2-input XOR gates.

    operand1[1]operand2[1]operand1[127]operand2[127] operand1[0]operand2[0]

    g[127] p[127] g[i] p[i] g[1] p[1] g[0] p[0]

    Figure 10: "Carry Generate" and "Carry Propagate" Block Implementation

  • 7/27/2019 Muravin Project

    14/25

    2.1.2.2 Carry Generation Block

    The signals )0]([ip and )0]([ig generated in Precondition Block are used within Carry

    Generation Block for calculation the )1]([ Mig signals, which could be represented as

    two-dimensional carry generate structure. Further subsequent sections describe theimplementation of Carry Generation Block for each one of the chosen designs.

    2.1.2.2.1 Carry Generation Block Sklansky Prefix Adder (cg_gen_sklansky)

    Following the Sklansky prefix tree (presented in 1.2.7), the following observation is

    determined (assuming a two-dimensional structure j rows by i columns):

    In the column i , cells occupy the nodes whose row coordinates j correspond to

    "1" in the binary representation of i , i.e. straight forward from binary encoding of

    the index i . The coordinate corresponding to "0" in the binary representation of i

    simply propagates the )]([ jip and )]([ jig

    All "o" (circle) cells are of GP type except of those situated in the bottom borderof ij 2log< .

    The output of GP cell is defined as following:

    )1](12mod[)1]([)1]([)]([ 1 += jiigjipjigjig j

    The output of G cell is defined as following:

    )1](12mod][)1]([)]([ 1 = jiipjipjip j

    )1](12mod[)1]([)1]([)]([ 1 += jiigjipjigjig j

    Following the prefix algorithm description, with 128=n the implementation consumes448 "o" cells, namely 448 2-input OR gates and the same amount of 2-input AND gates.

    The delay is 7 levels and the fan-out is 64.

    2.1.2.2.2 Carry Generation Block Kogge-Stone Prefix Adder

    (cg_gen_kogge_stone)

    Following the Kogge-Stone prefix tree (presented in 1.2.8) and assuming a two-

    dimensional structure j rows by i columns, the nodes in the upper-left are populated

    with "o" (circle) cells, while the rest of the two-dimensional array is empty, i.e. the "o"

    (circle) cells are placed in the nodes whose coordinates satisfy the following relationship:

    11 Mj and 112 1 + Nij

    The outputs of the placed cells are: )1](2[)1]([)]([ 1 = jipjipjip j

    )1](2[)1]([)1]([)]([ 1 += jigjipjigjig j

    Following the prefix algorithm description, with 128=n the implementation consumes769 "o" cells, hence occupying 769 2-input OR gates and the same amount of 2-inputAND gates.

  • 7/27/2019 Muravin Project

    15/25

    2.1.2.3 Sum Bits Generation Block (sb_gen)

    The sum bits are produced in Sum Bits Generation Block by XORing the "carry

    propagate" signals, )0]([ip , generated in Precondition Block, and the "carry generate"

    bits )1]([ Mig . Figure 11 illustrates the implementation, which is consuming 128 2-

    input XOR gates.

    p[1]g[0](M-1)p[127]g[126](M-1)

    s[127] s[i] s[1]

    carry_inp[0]

    s[0]

    Figure 11: Sum Bits Generation Block Implementation

    2.2 RTL Coding

    RTL coding is done in VHDL at the structural level. The basic cells are 2-input AND

    gate, 2-input OR gate, 2-input XOR gate and D-type positive edge triggered flip flop.The text editor used is emacs version 20.7 with vhdl mode, since it has many templates

    for arranging VHDL code in an alignment, which is easy to read. Each one of the files

    has a header at the top explaining the entity name and its logical function.

    2.3 Verification Plan

    In general, describing the same design functionality (especially of a large and complexdesign) by a high-level language, such as C/C++ or using verification tools, such as

    Verisity Specman, etc, is the way to verify the design in many scenarios with many

    possible input combinations.For the verification of the two full adders, the following is proposed (Figure 12).

    A test bench, which is written in behavioral Verilog, instantiates both designs. Two 128-

    bit numbers are generated using a dedicated LFSR (Linear Feedback Shift Register) [4],which generates pseudo-random bit stream.

    Each clock cycle, the values of two 128-bit numbers change in pseudo-random way.

    These values are summed using a '+' operation within the test bench and they are also

    applied as inputs to both adders. The resulting output sum and carry of each adder iscompared with the result generated by '+' addition within the test bench.

  • 7/27/2019 Muravin Project

    16/25

    A successful test case (test passed) is defined as the match between the result of a test

    bench and the result of each adder.

    128-bit PRBS

    Generator 1

    128-bit PRBS

    Generator 2

    operand1+operand2

    full_adder_sklansky128

    128

    result

    128

    carry_out

    test_bench

    operand1[127:0]

    operand2[127:0]

    result[127:0]

    carry_out

    test_bench

    results file

    128

    128

    full_adder_kogge_stone

    operand1

    operand2

    operand1

    operand2

    result

    128

    carry_out

    match_sklansky

    match_kogge_stone

    128

    128

    Figure 12: Test Bench & Verification Plan

    2.4 Synthesis, Place and Route

    Synthesis, placement and routing of the design are done using Xilinx ISE 6.3i software

    with the latest service pack SP3. The constraints are set for the best timing, by selectingthe optimization criteria "speed" with the maximum effort. More details on the results, aswell as the faced problems, are given in the section 3.2

    3 Results

  • 7/27/2019 Muravin Project

    17/25

    3.1 Simulation Results

    3.1.1 Initial Test Cases

    The initial test cases are defined as the sum of the following 128-bit numbers.

    The very first case verifies the sum of the following numbers:

    64 zeros followed by 64 ones. 64 ones followed by 64 zeros.The next case is:

    32 repetitions of 0xA. 32 repetitions of 0x5.

    In such fashion, the possible bit swapping or incorrect index generation is tested.Figure 13 illustrates the simulation results for the initial test case.

    operand1 and operand2 are, effectively, the two 128-bit numbers to be added.

    result and carry_out are outputs of each one of the adders marked by the appropriatedivider (Sklansky Adder and Kogge-Stone Adder, respectively).

    Figure 13: Initial Test Case Simulation Results

    3.1.2 General Test Case

    In general test case, the data is generated in a pseudo-random way, as described in the

    section 2.3. Three snapshots of the simulation results are given in the following figures.

    Figure 14 illustrates the entire simulation. The lowest divider separates the test benchsignals. operand1_prbs and operand2_prbs are the 128-bit PRBS data, which is applied

    to the adders. operand1 and operand2 are the input numbers; result and carry_out are

    the outputs of the adder circuits, marked by the corresponding divider (Sklansky Adderand Kogge-Stone Adder, respectively). Two more very important test bench signals are

    result_match_sklansky and result_match_kogge_stone, which are updated each clock

    cycle, depending whether there is a match between the test bench result and therespective result of Sklansky adder and Kogge-Stone Adder.

    Figure 15 and Figure 16 are giving two "zoom-in" examples of the same simulation.

  • 7/27/2019 Muravin Project

    18/25

    Figure 14: General Test Case - Full Zoom

  • 7/27/2019 Muravin Project

    19/25

    Figure 15: General Test Case - Example 1

    Figure 16: General Test Case - Example 2

  • 7/27/2019 Muravin Project

    20/25

    3.2 Synthesis Results

    Both designs were successfully synthesized for Virtex-2 device XC2V500. The synthesisresults are summarized in the following Table 2. It is noted that Kogge-Stone adder

    consumes more resources than Sklansky adder, just as it was expected.

    Results Explanation (Table 2): The input and outputs of the design were sampled inorder to achieve more true delay estimation, assuming that the inputs and the outputs of

    the design are registered. Furthermore, in the placement and routing stage, a specific

    option, which forces the flip-flops to be packed within the I/O buffer, is selected, so thatthe logic delay represents true estimation of each adders processing delay in this FPGA

    implementation.

    However, due to the fact the maximum available user I/O pins for this device is 264

    (package FG456), further placement and routing of the design, and, hence, the true

    estimation of its logic delay is not possible. Consequently, there are two alternatives. One

    alternative is multiplexing the I/Os in order to fit the design into XC2V500 device.

    Another alternative is to select a larger device, which is XC2V1000.Both the alternatives are described in the following subsections.

    Table 2: Synthesis Results (No placement and routing): XC2V500 -FG456-4 device

    Design LUTs usage 1-bit Registers

    Usage

    Total Slices

    Usage

    Maximum

    Frequency

    Sklansky

    Adder

    829 (13%) 385 (6%) 453 (14%) 85.6 MHz

    Kogge-StoneAdder

    1449 (23%) 385 (6%) 751 (24%) 100.5 MHz

    3.2.1 Multiplexing I/O

    This alternative requires complete redesigning of the interface and changing the overall

    architecture of the design. Either loading the numbers or outputting the result inmultiplexed way could have advantages and disadvantages, which are summarized

    further. In addition, handshaking signals, which designate the start of loading and the

    completion of the addition, are required.

    3.2.1.1 Multiplexed Inputs

    In this case, it is obvious that the design latency (overall processing time) will increase,since the whole input numbers cannot be acquired at once. However, there are two major

    advantages that could be achieved. First, the logic required for the addition could be

    reduced, since the logic performing the addition cannot process more bits than are presenton the interface at the same cycle. Consequently, the addition could be performed in

    multiplexed fashion, especially if the loading of the input numbers is done in the way that

    the least significant part of the numbers is loaded first. Second, that the overall speed of

    the design will definitely increase as the complexity and combinational levels of logicdecrease as well.

  • 7/27/2019 Muravin Project

    21/25

  • 7/27/2019 Muravin Project

    22/25

    4 Design Enhancement Pipelining

    The pipelining of the design is introduced in order to improve the design speed. There aretwo ways of applying pipelining. One, manual, is to locate the exact point at the critical

    path, which has an arrival time of exactly half the total delay of the critical path (or one

    third, if two pipeline stages are inferred, and so on) and insert a pipeline there. Another

    alternative, automatic pipelining, is described below.The location of the pipelining registers location is chosen automatically by Xilinx

    synthesis tool. In the design, N pipeline stages are added to the inputs, the outputs or

    both inputs and outputs of a design and the software optimizes the location of the pipelineregisters according to specified timing requirements and synthesis effort by moving them

    forward and backward. This is also referred as "forward/backward register balancing" in

    the tools (Xilinx ISE [6]) and "retiming" (Synplicity Synplify Pro 7.xx [7]) and it is

    illustrated at Figure 17 and Figure 18. The software automatically determines Td1 andTd2 corresponding to the given timing constraints and synthesis effort.

    Pipelinestage

    Pipelinestage

    sys_clk

    PipelinestageTd

    Pipeline

    stage

    sys_clk

    Td1 Pipelinestage

    Td2 Pipelinestage

    Td = Td1 + Td2 Figure 17: Forward Registers Balancing (Pipelining)

    Pipeline

    stage

    sys_clk

    Pipeline

    stage

    Td

    Pipelinestage

    sys_clk

    Td1 Pipelinestage

    Td2 Pipelinestage

    Td = Td1 + Td2

    Pipeline

    stage

    Figure 18: Backward Registers Balancing (Pipelining)

  • 7/27/2019 Muravin Project

    23/25

    Table 5 gives the result of automatic pipelining of Sklansky Adder.

    Table 6 gives the result of automatic pipelining of Kogge-Stone Adder.

    From the results, it is observed that:

    Adding one output pipeline stage improves the timing, while adding two pipeline

    stages does not. The main reason is the fact that the delay distribution, consists ofapproximately 25%-30% logic delay and approximately 70% routing delay.

    Despite that adding 2 pipeline stages improves flip-flop to flip-flop delay, due tothe routing delay, the total delay is worse than with only 1 pipeline stage.

    One other important factor that might prevent from achieving the goodperformance could be the high usage of I/O pins, which imposes another level of

    complexity for the place and route tool.

    The faster a certain path is, the more percentage of it is contributed by the actuallogic delay.

    Multiple iterations of synthesis, place and route produce slightly different results.

    Number ofPipeline Stages

    Total Slices Usage Maximum Delay /Frequency

    Delay DistributionLogic % / Routing %

    1 input stage 551 (11%) 10.895 ns / 91.7 MHz 33 / 67

    2 input stages 746 (14%) 9.9 ns / 101 MHz 36 / 64

    1 output stage 603 (11%) 12.174 ns / 82.1 MHz 32 / 68

    2 output stages 630 (12%) 12.644 ns / 79.1 MHz 27 / 73

    1 stage at input

    and output

    571 (11%) 8.905 ns / 112.2 MHz 43 / 57

    2 stages at input

    and output

    777 (15%) 8.698 ns / 114.9 MHz 45 / 55

    Table 5: Placement and Routing Results of Pipelined Sklansky Adder

    Number of

    Pipeline Stages

    Total Slices Usage Maximum Delay /

    Frequency

    Delay Distribution

    Logic % / Routing %

    1 input stage 838 (16%) 11.112 ns / 89.9 MHz 32 / 68

    2 input stages 948 (18%) 10.597 ns / 94.36 MHz 28 / 72

    1 output stage 852 (16%) 8.802 ns / 113.6 MHz 30 / 70

    2 output stages 933 (18%) 9.286 ns / 107.68 MHz 41 / 69

    1 stage at inputand output

    888 (17%) 7.724 ns / 129.4 MHz 43 / 57

    2 stages at input

    and output

    1075 (%) 7.612 ns / 131.3 MHz 47 / 53

    Table 6: Placement and Routing Results of Pipelined Kogge-Stone Adder

  • 7/27/2019 Muravin Project

    24/25

    5 Summary and Conclusions

    Two different parallel prefix 128-bit adders were designed, analyzed and tested.

    In the beginning of the design process, it was noted that the required device (XC2V500)

    couldnt accommodate the requirements because of the limited number of the available

    user I/O pins. Two alternatives were discussed and considered for further step of thedesign: using the multiplexed I/O and, hence, reducing the overall number of the used

    I/Os or changing the target device to XC2V1000. The second alternative was chosen

    because it did not require redesigning and involving other levels of complexity.

    It was observed that due to the nature of Kogge-Stone prefix, the expected resource usage

    of Kogge-Stone adder will be greater comparing with Sklansky adder and it was justifiedby the results.

    It was also observed that multiple iterations of the same designs synthesis sometimes

    produce slightly different placement results in terms of logic resources usage and timing.

    The reason for this is the fact that the placement and routing algorithm used by Xilinxtools is based on randomized initial settings [6], [8], in opposite to Altera [7].

    Pipelining by inserting a number of pipeline stages enhanced the designs and the results

    were analyzed. It turns out that the pipelining is not necessary improving the design

    speed. The main reason for this is that the delay distribution in most cases consists ofapproximately 20% to 40% of the actual logic and the rest, which is 80% down to 60%,

    respectively, of routing delay. So, it is concluded that adding more pipeline stages does

    not necessary improves the total delay.

  • 7/27/2019 Muravin Project

    25/25

    6 References

    [1] R. T. Brent and H. T. Kung "A regular layout of parallel adders", IEEE Trans.Comput. Vol. C-31, No 3, pp. 260-264, March 1982

    [2] J. Sklansky "Conditional-sum Addition Logic", in IRE transactions of electronic

    Computers, Vol. EC-9, No 2, pp. 226-231, June 1960

    [3] P. M. Kogge and H. S. Stone "A parallel algorithm for the efficient solution of a

    general class of recurrence qeuations, IEEE Transactions on computers. C-22(8):260 264. Aug 1973

    [4] Paul H. Bardell, William H. McAnney, and Jacob Savir, "Built-In Test for VLSI:Pseudorandom Techniques", John Wiley & Sons, New York, 1987

    [5] V. G. Oklobdzija, E. R. Barnes, "Some Optimal Schemes for ALU Implementation in

    VLSI Technology", Proceedings of the 7th Symposium on Computer Arithmetic ARITH-

    7, pp. 2-8. Reprinted in Computer Arithmetic, E. E. Swartzlander, (editor), Vol. II, pp.137-142, 1985.

    [6] Xilinx Programmable Logic Devices PLD & FPGA, www.xilinx.com

    [7] Synplicity Synplify Pro 7.02 users guide www.synplicity.com

    [8] Xilinx ISE 6.2 / 6.3 users manual www.xilinx.com


Recommended