+ All Categories
Home > Documents > Top Down Asic

Top Down Asic

Date post: 04-Apr-2018
Category:
Upload: krishna-m-viswanath
View: 234 times
Download: 0 times
Share this document with a friend

of 17

Transcript
  • 7/30/2019 Top Down Asic

    1/17

  • 7/30/2019 Top Down Asic

    2/17

    Abs t rac t

    This paper describes recent experiences using physical synthesis tools and top down timing and physical

    budgeting to complete a 1.6M gate, 0.25u ASIC. The focus of the paper is on the use of floorplanning,

    top level routing, physical synthesis, and static timing analysis tools to minimize, or remove the need forcostly back end iterations in order to close timing on large designs.

    The design in question contains several hard and soft proprietary IP blocks including third party IP for aPCI controller, on-chip memory BIST and LogicBIST. Top down timing and physical budgeting isapplied to partition the design into 11 functional blocks, including 4 JTAG boundary scan blocks. Block

    level placement and top-level interblock routing occurs early in the process to drive the refinement of

    the initial timing and area budgets. Each of the blocks is then independently synthesized, placed androuted using Synopsys and Cadence tools. After all blocks are complete, they are re-assembled and

    merged with the top level routing to create a final design that meets the originally budgeted timing in a

    single pass.

    The methods used are not particularly design specific and can be applied to similar or even larger SoCdesigns to assist in shortening the overall design cycle.

    Since this is a confidential customer design of Synopsys Professional Services, the application in

    question is not disclosed. The tools and flow used were installed at both the customers site and atSynopsys design centers.

    Author Biographies

    Chris Smith received his Bachelor Of Engineering (Honors) in EE from the University Of Wales

    Institute of Science and Technology (UWIST) in 1981. Since then has continuously worked in the

    electronics and EDA industries, his first ASIC was designed in 1984/5 using schematic capture on a

    Daisy Logician workstation for a 6K gate ASIC in a 3u DLM process. He has worked for BritishAerospace, LSI Logic, Motorola, and Cascade Design Automation. Since 1998 he has been a consultant

    for Synopsys Professional Services.

    Devin Bright received his Bachelors of Science degree in Electrical and Computer Engineering in 1991

    and his Masters of Science degree in Electrical Engineering in 1993, both from the University of Iowa.

    Since 1999 he has been a consultant with Synopsys Professional Services where he has assisted multiplecustomers design numerous complex ASICs. Prior to joining Synopsys he worked for Motorola

    architecting cellular infrastructure equipment and designing necessary embedded silicon.

    The authors have worked together on various Synopsys Professional Services consulting projects with

    customers to introduce similar design flow methodologies, as are discussed in this paper, for a widerange of applications and technologies.

  • 7/30/2019 Top Down Asic

    3/17

    I n t r o d u c t i o n

    Top Dow n Flow Overv iew And Flow Compar ison

    Implementation of the ASIC is primarily a top down approach but some steps, such as initial blocksizing estimates, are performed in a bottom up manner where necessary to reduce iterations.

    In a top down approach the chip is implemented from the die size, I/O timing and block placementrequirements downwards while in a bottom up approach the chip is implemented from block levelsynthesis results upwards.

    Tradi t ional f loorplanning based design f low

    In many design flows floorplanning tools have been in use for several years as an interface to detailed

    place and route. Floorplanning has typically been seen as a task that aids the back end design team and is

    usually performed late in the design implementation schedule, this does little to enhance the efficiencyof the front-end design team during the early stages of design realization.

    Feedback to the synthesis process occurs during the later stages of design implementation, in the form of

    extracted RC data from post route, and in general the problem of timing closure becomes a process of

    the back end team iterating the floorplan and place and route and re-generating RC data for the front endteam to analyze.

    In addition for large SoC designs the critical top level, inter block, timing and routing information is not

    generally available until after all of the blocks have been through place and route this is especially truefor design flows that flatten the complete design prior to place and route.

    The inefficiencies resulting from this late stage data feedback often results in needless and costly

    iterations between the front end and back end design teams.

    BlockLevel

    Synthesis

    Netlist

    Assembly

    Floor-

    planning

    Place &

    Route

    Chip

    LevelSTA

    RCExtractio

    n

    Block Level

    Constraints

    Chip Level

    Constraints

    Front End Back End

    Timing Closure

    Loop

    Figure 1: New Tools Old Flow Methodology

  • 7/30/2019 Top Down Asic

    4/17

    Floorplanning as a f ront end tool

    The floorplanning stage of the design flow can, and should, be moved into to the Front End of thedesign process to speed up the process of timing closure for large SoC designs. The flow as depicted in

    Figure 1 usually requires the handoff of data between multiple teams, and as a minimum, leaves full,

    accurate, knowledge of the chip level timing until all blocks have been placed and route along with thetop level interconnect. Typically chip level constraints are created, in a manual process, from the block

    level constraints and seen as a separate step from creation of the block level constraints. In many casesfull chip level constraints are created after the fact from the individual block level constraints used for

    synthesis.In a top down approach, physical constraints (e.g. block abstract, pin placement, top level routes, etc.)

    and timing constraints are pushed down from the chip level to the block level early on and used to drive

    the parallel implementation of each block. The overall chip is sized based on early area estimates for softmacros, which can be derived from a quick synthesis run (not necessarily with finalized RTL), the

    designers judgment and experience is needed to estimate sizes for incomplete or missing blocks. The

    overall core area is then partitioned amongst each block proportionally based on the estimated arearesources required and overall timing requirements.

    Similarly timing constraints are defined at the chip I/O level and pushed down to the block level.

    Special consideration, or budgeting, needs to occur for those paths that are inter block in nature (i.e.beginning in one block and ending in another block, potentially passing through multiple blocks inroute). For such a path, the overall timing is budgeted amongst the blocks through which it passes,

    including net delays through for top-level routing paths. Using PrimeTime to generate block level

    constraints from a set of chip level constraints is key to this flow.Early definition of top level routing and timing is essential for large designs if timing closure iterations

    are to be minimized.

    Top-

    LevelNetlist

    Assembl

    Floor-

    planning

    BlockLevel

    Route

    Chip

    LevelSTA

    FinalRC

    Extraction

    Block Level

    Constraints

    Chip Level

    Constraints

    Front End Back End

    PrimaryTiming Closure

    Loop

    Top-

    Level

    Routing

    RC

    Extractio

    n

    Module

    Stubs

    Block

    LevelSynthesis

    Secondary TimingClosure Loop

    Figure 2: Floorplan Centric Flow

  • 7/30/2019 Top Down Asic

    5/17

    Such an approach, of early allocation of constraints, creates an environment in which the blocks can berealized independently in parallel. When the blocks are completed then, by virtue of meeting all

    pushed down block level constraints, the overall chip can then be assembled. Assuming an accurate

    budgeting process was performed, the composite chip is then well positioned to meet chip level

    constraints in a single pass.The main timing closure loop occurs in the front end of the design flow where tradeoffs between

    block size, RTL implementation and top level routing can more easily be made. By providing accuratetop level routing and good block level routing estimates it is possible to generate a consistent set ofblock constraints which, if all met individually, will result in overall chip level timing closure.

    The back end process then becomes the relatively simple task of completing block level routing for each

    block in the design driven by the physical and timing constraints generated during floorplanning andtop level routing.

    Design Descr ip t ion

    The ASIC is an element of a highly integrated processing subsystem for a next generationcommunication system. As with many typical communication systems, embedded memory

    requirements were large, consisting of approximately 1.9M bits of memory. This logical memory wasimplemented in nearly 200 discrete memory macros in order to maintain the required memory cycletime speeds. Additionally, another 1.6 M discrete gates were required to implement the desired system

    level functionality in the selected 0.25 u, 5 metal layer process.

    This design is mostly synchronous, using primarily 4 clocks derived from two multiplying analog PLLs(APLL), with a few additional, non-synchronous clocks associated with external peripheral interfaces.

    The fastest clock output from the APLL, and also the most prevalent, was a 135 MHz clock.

    PCI Interface

    APLLs

    JTAG BSR Blocks

    JTAG BSR Blocks

    Figure 3: Chip Top Level Blocks

  • 7/30/2019 Top Down Asic

    6/17

    Top Leve l F loorp lann ingDie Sizing

    The ASIC was sized at 9.25 mm by 9.25 mm. The die size is the primary physical constraint, within

    reasonable bounds the die size can be varied to trade off and optimize numerous technical and non-

    technical criteria. These criteria include packaging requirements, wafer yield (and associated cost per

    die), ease of place and route, and timing requirements. For a hierarchical implementation spacing forinterblock routing at the top level must also be accounted for and may be traded off against increasing

    die size or decreasing block sizes and increasing potential congestion issues.

    I /O Plac em ent

    The ASIC was designed using standard perimeter I/O, a key constraint driving the I/O placements wasthe signal groupings per the board layout. The I/O were positioned around the die such that when

    bonded out via the package, they were positioned consistent with the requirements derived from the

    board layout and routing tracks.In addition, the need to manage timing skew across a wide bus was also another constraint for I/O

    placement. The skew sensitive I/O signals were evenly distributed around two sides of the dies lower

    right corner. Given that the communicating block emanated from the same corner, such an approach

    minimized the difference between the farthest and nearest I/O in the bus. This was a first step atproactively managing the route variability, or skew across the bus.

    For the ASIC, three separate power supplies (core, I/O, and analog) were required in varying quantities

    to supply various parts of the design. The core supply pads were evenly distributed around all four sidesof the die to power a core assumed to be uniform in its dissipation. The I/O power pads were placed

    such that the peak I/O power dissipation between any two adjacent pair would be nearly consistent. The

    analog power pads were positioned in close proximity to the APLL macros that required them.Beyond the macro positioning discussed so far there were also micro positioning requirements that

    required attention. Most notably was the concept of the minimum I/O pitch requirement, which was

    satisfied by respecting the spacing requirement. Less obvious was the minimum wire pitch requirementin place for the bonding wires. These requirements could not be checked until the die size was fixed and

    the package substrate designed. The likelihood for impact from this rule was minimized by providing

    extra space between the I/O cells near the corners, where extreme angles (i.e. non-orthogonal) of the

    wires were most likely to result in a spacing violation.

    Block Sizing and Placem ent

    Block sizing and positioning is, by necessity, an iterative process in which block aspect ratios, blockarea, and overall placement must be balanced to produce a floorplan that will meet the die area,

    routeability, and timing requirements. While automatic block level placement may be used to provide an

    initial starting point, for complex SoC designs it is usually necessary to use manual intervention to

    create an optimal floorplan solution. Determination of an optimal floorplan requires that the tools usedprovide meaningful metrics and feedback by which to measure the correctness of the floorplan.

    Chip Architect provides visual connectivity feedback in the form of rats nest fly-lines between theblocks, and weighted net connectivity, which show the number of connections between blocks.

    Quantitative metrics such as total Steiner Route wire length, individual net global route lengths or

    estimated resistance/capacitance from global routes may be used to monitor whether a given floorplan

    improves upon its predecessors.

  • 7/30/2019 Top Down Asic

    7/17

    Rats Nest Flyline Display Weighted Connectivity Display

    Figure 4: Different Connectivity Displays

    For the ASIC, a rough estimate of each blocks area was obtained by a quick synthesis (using Synopsysdesign_compiler) to obtain a gate/instance count and then basing the block size on the instance and/or

    gate count placed at a certain density (typically 80% was used). A physical hierarchical partitioning wasthen selected which attempted to keep related pieces of logic together, while still balancing block sizesand minimizing block interconnectivity. A balance needs to be made between the size of core blocks

    and the number of core blocks. This ultimately resulted in 11 core logic blocks and 4 BSCAN blocks.

    Once identified, these block sizes were used to create the initial floorplan layout. The floorplan was

    created and manipulated using Chip Architect.The four BSCAN blocks had extreme aspect ratios; the purpose of these blocks was to provide an array

    in which the JTAG Boundary Scan Registers (BSR) could be placed. These blocks were long and

    narrow, effectively spanning the I/O on a side.As an early floorplan quality check, Chip Architect pin assignment and global router capabilities were

    used to identify any major problems. Sufficient block perimeter to support the pins at the desired pitch,

    as well as channel utilization were two of the metrics monitored. By this approach, major top-levelrouting problems were identified early on when they were easiest to correct.

    First Pass Block Pin Placem ent

    Initial soft macro block pin placement is determined using Chip Architect in a top down manner, pin

    assignment being driven only from the top level interconnect of the design. Global routing and

    congestion analysis are used to determine the QoR of the pin placement and manual or scripted

    modifications to block pin locations made to improve QoR in an iterative process.Block level pin constraints are used to restrict the pin placement to specific routing layers, required grid

    spacing, and specific sides of each block.

    For simple pin assignment tasks (such as straight bus connections between blocks and I/O pads) the

    Chip Architect results, which are based upon a fly-line connectivity analysis, are usually satisfactory, forcertain other regular pin assignments scripts were generated to create design specific pin placements a

    specific example of this is discussed below for the BSCAN blocks.Once the top-level netlist is stable and the top-level floorplan is reasonably defined further refinements

    on the block pin assignment and a more complete analysis of the top level routing is performed using

    FlexRoute.

  • 7/30/2019 Top Down Asic

    8/17

    BSR Pin Plac em ent

    The 4 BSCAN blocks had extreme aspect ratios that resulted in a limited number of routing resourcesacross the shorter width of the block. To avoid congestion problems inside of the blocks, pin

    assignment for each BSCAN cell input had to be closely aligned with the associated BSCAN cell output.

    Since both Chip Architect and FlexRoute have no capabilities to match input/output pin pairs scriptswere developed to assist in pin placement for this special case.

    The pins on the I/O side of the blocks were automatically aligned with their respective I/O cells as a partof the default Chip Architect pin assignment. The pins on the core side were then aligned, using a design

    specific Tcl script, with the matching pin on the I/O side of the block. Without this scriptingmethodology the core side pins would tend to clump together around the top level routing channels into

    which their connections flowed. While this would be effective for easing top level routing it would

    cause significant routing congestion issues inside of the BSCAN blocks.It was felt that the greater constraint on these pin placements was the limited routing resources inside of

    the blocks as opposed to the top level routing resources and so a small penalty in top level routing was

    traded off against easier to route BSCAN blocks.

    Loose Top Level Cells

    The design includes a few (less than 20) top-level cells that are used to implement system and test clockmultiplexing and APLL test control logic. These cells did not lend themselves well to fly-line driven

    auto placement and it was found to be more effective to manually place them inside of Chip Architect.

    Since in many cases they form the start or top of a clock tree network they tended to be rather simple tolocate in the center channels of the floorplan.

    Top Leve l P in Ref inement And Rout ing

    Top level routing was implemented using FlexRoute, a gridless, n-layer, object-based router. The initialpass of FlexRoute is used to refine the block pin locations based upon real routing considerations.

    FlexRoute performs Routing Based Pin Assignment (RBPA) by running multiple global routes in a

    process that aims to find the optimal pin location for each signal that traverses the top level. As in Chip

    Architect, pin locations are also driven by block specific parameters for preferred layers and sides ofblocks.

    The primary objectives for pin assignment are assigning the pins so that the top-level routes are minimallength and do not cause congestion hot spots between the blocks. This tends to imply that when possible,

    pins are assigned such that straight line, single metal nets provide the top-level connectivity. Congestion

    analysis can be performed and further constraints applied to specific nets or routing channels within the

    chip to reduce congestion hot spots.Once the pins assignments have been updated (and fed back to Chip Architect) a full detailed route of

    the top level nets can be performed in FlexRoute. Although FlexRoute is gridless and will perform pin

    placement based upon simply meeting design rules it is necessary to restrict the block pin placement tobe on grid. This means that the block level router (which is a gridded router) has a simpler task when

    routing to the block boundaries from the inside.The FlexRoute global and detail routing engines can take account of net or net segment specificparameters such as metal width and spacing, net shielding, preferred layers and coupling capacitance

    avoidance. Utilizing increased spacing between critical routes, such as clock lines, helps to minimize

    any propensity to noise susceptibility due to long routes running in parallel with each other.

    For this particular design all blocks were designated as total keepouts for top level wiring. It ispossible to route over the blocks by allowing one or more routing channels in different metal layers but

    this does add some complexity to the extraction of both top level and block level parasitics in isolation

    since the interaction of the over-block and internal block routing must be considered.

  • 7/30/2019 Top Down Asic

    9/17

    Top Level Repeater Insert ion

    In a bottom up approach block output loading constraints are conservatively estimated or block outputdrivers are set to the largest available driver in the library. Whether these estimates are correct, or

    whether the largest driver in the library is sufficient is not known until the top level timing is analyzed.

    Typically the top-level timing cannot be analyzed until the final block integration and detailed routestages of the design flow are completed at which point iteration back to the block level and/or insertion

    of top level buffers is required to implement any changes required.For a top down flow, such as the one outlined here, it is critical that accurate top-level timing and

    loading information is available early in the design flow in order to drive the budget generation process.Top level routing can be completed in FlexRoute without any knowledge of the timing requirements but

    for many nets the resulting RC network may be too large to be driven by the output of a block. As a

    result it is also necessary to insert repeater cells (buffers or inverters) at the top level of the design toboth maintain block input edge rates and reduce the output loading of the blocks output driving cell.

    Following top level routing FlexRoute is used to insert buffers based upon the top-level RC network

    (extracted by FlexRoute) and the driving and load cells within the blocks. The information regarding thedriving and load cells, and any RC network inside of the soft macro blocks, is provided to FlexRoute by

    Chip Architect in a file format known as TBEF (Timing-Based Exchange Format) which contains the

    following data:* Driver cell name and the distributed RC network from the driver output to the block pin location* Receiver cell names and the distributed RC network from the block pin location to the receiver

    cell input

    * Input rise time (maximum transition time) at each receiver cell input* Net delay constraint from the driver cell input to each receiver cell input

    * Database library cell name and the circuit's operating conditions (process, voltage, and temperature

    values)Figure 5 shows how the various drivers, loads and RC networks are modeled using TBEF

    Load/Drive From PT

    Network modeled internally by FlexRoute

    Network modeled by TBEF File

    Figure 5: Top Level TBEF Modeling

    FlexRoute uses PrimeTime's delay calculator, when it calculates the cell delay for the hierarchical driver

    and repeaters both the hierarchical drivers/loads and the repeaters are analyzed during optimization.

  • 7/30/2019 Top Down Asic

    10/17

    At early stages in the design flow, such as when blocks are incomplete, default values for driver/loadcells and/or block internal RC networks can be used.

    For this design the initial repeater insertion flow was ran in a mode where the aim was to fix input edge

    rates only, this results in a top level netlist and associated RC data that meets the technologies basic

    DRC rules. This netlist and top level RC data is used to accurately model the block level interconnectduring the push down of t he top level timing constraints and block level budget generation.

    After synthesis and placement of the blocks (with the generated block level budgets) repeater insertion

    was re-ran to update the top level netlist with any necessary repeater size changes and/or new repeaterinsertions; during this run net delay constraints were used on paths identified as critical during the top

    level timing analysis.

    Pin Placem ent Analys is

    As mentioned above the QoR of a floorplan and associated pin placement can be judged by bothqualitative (e.g.. are the congestion hotspots getting smaller) and quantitive (e.g. is the total wirelength

    reduced) metrics. Figure 6 shows three steps in pin assignment using the same floorplan along with the

    associated total wirelength.

    Wirelength = 36,155u Wirelength = 25,309u Wirelength = 21,766u

    Figure 6: Wire Length Results For Various Pin Placements

    The first snapshot is of the initial floorplan with no top-level pin assignment so pin placement is driven

    (upwards) purely by the block level pin requirements. The second snapshot shows the results of a topdown pin assignment using Chip Architect. The third snapshot shows the results of a combination of

    scripted pin placements for the BSCAN blocks and RBPA from FlexRoute.

    Clearly both the actual floorplan layout and the block level pin placement have a dramatic effect on the

    total routing length (and by inference the congestion) at the top level of the design.

    Block Leve l Placement

    Given that the block aspect ratios and pin assignments have stabilized, block level internal refinement

    and macro placement can begin. For those blocks that have macros (e.g. RAMs, ROMs, etc.), manual

    pre-placement must be performed. These macros were placed such that the placement array, the area inwhich the discrete standard cells will be placed, was contiguous and absent needless fragmentation.

    Close attention was paid to macro to obstruction placement such that all pins are fully accessible at the

  • 7/30/2019 Top Down Asic

    11/17

    time of route. This meant taking into account the location of the memories with respect to block edgesand power supply straps and trunks.

    In the case of this ASIC, 5 blocks had approximately 180 discrete memories that required placement.

    These physical memories formed a few dozen larger logical memories; hence a significant portion of the

    placement exercise was properly grouping related memories.Upon completion of any macro level placement, the block level power plan is finished inside of Chip

    Architect to provide a true representation of the placement keepout areas and blockages within the

    block. For this particular design a series of block level rings were used to provide block level power asopposed to a global power striping approach.

    Once the block is completely described the block physical information can be exported for use in

    Physical compiler where the timing constraints and RTL or gate level netlist (as was the case for thisdesign) are read in to provide both physical and timing constraints. A Gates to Placed Gates flow was

    used for this design where the design was synthesized to gate level using Design Compiler and then

    Physical Compiler was used to place and optimize the resulting netlist. This choice of flow was driven

    by the BIST test insertion tool requirement to operate on a single chip level gate level netlist in order toinsert the requisite test points and test control circuitry. Due to the test tools lack of technology specific

    knowledge regarding cell drive strengths and circuit optimization strategies this required that a

    preliminary optimization be performed using Design Compiler prior to loading the design into Physical

    Compiler. A better approach would be to use test tools that operated at the RTL level and werecognizant of the design timing requirements, this would allow a more flexible RTL to placed gates flow

    to be utilized. As of the writing of this paper there are no such tools openly available.Physical Compiler optimizes the design based upon several cost functions such as routing congestion,

    DRC costs (e.g. transition time, fanout, etc.), timing, etc. while performing standard cell placement.

    Such an approach allows for the prediction of routability, timing performance, along with the ability tofix most DRC problems.

    Since the block constraints are developed top down and are independent of each other multiple, physical

    Compiler runs can be used to process the blocks in parallel. There is no need to wait for one or more

    blocks to finish in order to derive the constraints for other connected blocks in this way the overalldesign cycle time can be reduced significantly. Due to the speed of the flow it was not found necessary

    to implement an ECO type flow since, once the flow was set up, block level changes could beimplemented very quickly from scratch.Block Level Congest ion Analys is

    Chip Architect allows the user to generate routing congestion maps based upon a global route. While notthe actual final route it can be shown that the level of detail used in this global route models the final

    Cadence SE detail route to a high level of accuracy. The global route takes into effect the various

    routing layers available, the number of individual wires that can fit in a given routing channel, andobstructions to routing such as power rails, wiring keepout areas and cell ports.

    In the congestion map areas highlighted as over 100% utilization should be examined carefully; in some

    cases the detail router will be able to detour around the congested areas to find a solution; in other cases,

    such as those close to block perimeters or in-between large hard macros even a single point of 100%utilization may cause routing issues later in the flow. In general the more highly congested the routing

    area, and the more widespread the congestion, the less likelihood that there will be good correlation

    between the Chip Architect route estimates and the Cadence SE detail routes.

  • 7/30/2019 Top Down Asic

    12/17

    Higher Congestion

    Area

    Lower Congestion

    Areas

    Figure 7: Block Level Congestion Analysis

    The congestion map for one of the larger blocks is shown in Figure 7, it is difficult to interpret in black

    and white but below is a textual report of the congestion (the actual display in Chip Architect is color

    coded).

    The report shows the percentage of global routing resources for each layer that fall into a particularutilization figure, the rows do not all add up to 100% since the fixed obstructions such as power straps

    are excluded.

    Metal Peak

  • 7/30/2019 Top Down Asic

    13/17

  • 7/30/2019 Top Down Asic

    14/17

    The histograms below show the results of comparing Chip Architect, FlexRoute, and HyperExtractfollowing top level routing. HyperExtract is taken as the golden data in this case since it performs a

    full (and slow) 3D extraction.

    Based upon the results shown below (and other testcases) we are confident in using the extraction results

    from FlexRoute in guiding both our top level timing and our overall chip level timing closure.

    Chip Architect Vs HyperExtract Capacitance Delta

    0

    100

    200

    300

    400

    500

    600

    -1 0 1 2 3 4 5PF

    Number ofNets

    Figure 9: Chip Architect Vs HyperExtract Capacitance Delta Histogram

    As can be seen from the histogram Chip Architect extraction shows significant differences from thetrue HyperExtract data. This is due to the length of the top-level routes and the additional spacing

    and/or shielding that appears in the routing and is not modeled in the Chip Architect global routes.

    Capacitance Delta

    0

    200

    400

    600

    800

    1000

    1200

    -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3

    PF

    NumberofNets

    Figure 10: FlexRoute Vs HyperExtract Capacitance Delta Histogram

  • 7/30/2019 Top Down Asic

    15/17

    The FlexRoute 2.5D extraction matches very well with the HyperExtract results, there are still somedifferences but these occur only on long nets where the actual delta is a small percentage of the total

    capacitance.

    Clock Tree Methodology

    The clock tree methodology planned up front was to create independent clock distribution trees in thehierarchical blocks and a tree at the top level of the design to balance skew between the blocks - for thisapproach to work each block was required to have the same insertion delay for a given clock.

    The minimum insertion delay for a clock was determined by processing each clock/block combination

    through Cadence CTGEN. As expected the largest blocks, with the largest number of clock endpoints,resulted in the largest insertion delay.

    A small overhead was added to the minimum insertion delay to ensure repeatability if the number of

    loads in a block increased and this insertion delay was then used to create equivalent depth clock trees inthe smaller blocks. Since several blocks had multiple clocks several runs of CTGEN were required

    initially to build up this information.

    As the insertion delays on separate trees were matched, care was taken to make sure that the number of

    levels of buffers in the trees remained constant. This helps to maintain insertion delay tracking overprocess temperature and voltage.

    With the blocks having equal insertion delays the problem of clock tree insertion at the top level became

    one of creating a zero skew network from the clock source (typically a loose standard cell Mux at thetop level) to the clock input pins on each top-level block.

    Attempts to use several different automated clock tree methodologies, including CTGEN, produced less

    than optimal results at the top level. This was concluded to be due to the very large area at the top levelcombined with sparse placement area availability and the small number of clock loads.

    In the end a manual approach was taken to inserting the small tree required for the top-level clock

    balancing. A Tcl script was developed which allowed the user, using Chip Architect, to manually insert

    a buffer into a clock net, place and route it in the floorplan, and receive immediate feedback as to thebalance of the loads it is driving.

    To provide quick feedback the output net connected to the newly inserted buffer was incrementally

    routed and a Reduced SPEF generated for the net and its loads. Since the rspef reduces the driven RCnetwork into an equivalent RC load for the driving cell, and a set of endpoint delays for the loads on the

    net it is a simple task to parse the resulting rspef and provide accurate feedback regarding the skew from

    the inserted buffer to the loads it drives.In order to speed up the analysis each of the top-level blocks were replaced by an extracted STAMP

    model of its internal netlist, or an estimated input pin loading where blocks were incomplete. The use of

    abstracted models ensures that the RSPEF extraction and subsequent analysis runs in a matter ofseconds, without these models the RSPEF extraction takes a significant amount of time since it traverses

    the hierarchy to each clock tree end point.

    By working backwards from the load points (i.e. the inputs to the block) it was a simple matter to

    balance each level of the top-level clock tree. Due to the small number of block input pins to be drivenonly three levels of clock tree, using 6 buffers, were required for the largest tree.

    After the complete tree was assembled in a piecemeal fashion a full clock tree analysis was run in

    PrimeTime using post route data from FlexRoute to verify final results.For the largest block the largest skew value was 280ps, the worst-case top level skew was 40ps, the

    worst-case difference in insertion delay for the blocks was 20ps. The final, post route, measured clock

    skew across the chip was 330ps worst case.

  • 7/30/2019 Top Down Asic

    16/17

    Mul t i B lock T iming Analys is

    Since DSPF is not a hierarchical format the individual DSPEFs for each block and the top level must bemerged. Full chip timing analysis is achieved by merging the multiple DSPEF files generated for each

    block and completing the RC networks at the interface to each block within PrimeTime. Rather than

    wait for all blocks to be completed it is useful to perform partial chip timing analysis simply byannotating the top level and those blocks that are available.

    For the final timing analysis the top level DSPEF is extracted from the FlexRoute database while theblock level DSPEF files are extracted from a post route Cadence SE database using Synopsys Arcadia or

    Cadence Hercules.At intermediate stages of block completion DSPEF generated from Chip Architect may be used as an

    accurate estimate for the block level post route DSPEF. As indicated previously this can be performed

    with a high degree of confidence that the timing will match post route timing. The script environmentused to back annotate the DSPEF used a search path to determine which whether to back annotate the

    actual Cadence DSPEF or the Chip Architect estimated DSPEF if the Cadence actual DSPEF was not

    available.

    FlexRoute

    Top LevelDSPEF

    BlockVerilog

    BlockDSPEF

    Top LevelVerilog

    ChipArchitect

    PrimeTime

    Chip Timing

    Constraints

    Full Chip

    SDFPrimeTime

    Top LevelLEF/DEF

    Figure 11: Full Chip Timing Analysis Flow

    As each block level DSPEF file is annotated in PrimeTime an error message is generated for each

    primary I/O of the block. These messages are generated since, at the time of reading the block, the RC

    network is incomplete for primary I/Os (i.e. it does not contain a complete driver, RC network, andreceiver). After the final top-level DSPEF file is annotated a final check (using the PrimeTime command

    report_annotated_parasitics list_not_annotated) is performed to ensure that all nets, especially thosethat cross hierarchical boundaries, have been annotated correctly.Once all of the sub blocks, and the interconnecting top-level design, have been back annotated with

    DSPEF the PrimeTime command complete_parasitics is used to merge the interblock networks.

    A technique used to speed up multiple chip level timing analysis runs was to back annotate all of the

    DSPEF files and then write out a full chip SDF file for each operating condition. Subsequent timinganalysis runs then save time by simply reading the existing SDF file instead of the relatively slow

    process of annotating multiple DSPEF files and calculating delay information. Tcl scripts using search

    paths and file time stamps can be used to automate this process.

  • 7/30/2019 Top Down Asic

    17/17

    Chip Assembly

    As individual blocks finish routing and are verified for post route timing they can be individuallyverified for DRC/LVS issues in a stand-alone mode.

    The final chip assembly operation occurs inside of a GDSII Editor where the individual blocks are

    merged with the top level GDSII. This particular design used a Synopsys tool (SLE) for GDSII mergingbut the same process is easily accomplished in other tools.

    Once assembled it is necessary to perform LVS and DRC runs to check for problem at the interfacesbetween the top level routing structures and the block level structures.

    Layout

    Editor

    Figure 11: GDSII Merging Process

    Summary

    Using the flow outlined it has been shown, on a real design, that timing closure can be driven from the

    top level of a design, and that timing closure need not wait for a full flat GDSII file of the design to

    perform extraction on.

    Using top down physical and timing budget generation individual SoC blocks can be implemented inrelative isolation from each other and still provide timing closure at the chip level when they are re-

    assembled. The critical top level detailed routing can be completed early in the design flow and the

    extracted RC information used to drive the block level budgeting and synthesis process. The back enddesign flow involvement in the timing closure loop is reduced to the detail routing and extraction of the

    individual SoC blocks.

    Adoption of the flow, and the use of Chip Architect, FlexRoute, and Physical Compiler may require achange in the organization or roles of typical design teams as more of the physical implementation is

    performed in parallel with the development of the RTL and ideally by the same engineers.


Recommended