Top Down Asic

7/30/2019 Top Down Asic

1/17


2/17

Abs t rac t

This paper describes recent experiences using physical synthesis tools and top down timing and physical

budgeting to complete a 1.6M gate, 0.25u ASIC. The focus of the paper is on the use of floorplanning,

top level routing, physical synthesis, and static timing analysis tools to minimize, or remove the need forcostly back end iterations in order to close timing on large designs.

The design in question contains several hard and soft proprietary IP blocks including third party IP for aPCI controller, on-chip memory BIST and LogicBIST. Top down timing and physical budgeting isapplied to partition the design into 11 functional blocks, including 4 JTAG boundary scan blocks. Block

level placement and top-level interblock routing occurs early in the process to drive the refinement of

the initial timing and area budgets. Each of the blocks is then independently synthesized, placed androuted using Synopsys and Cadence tools. After all blocks are complete, they are re-assembled and

merged with the top level routing to create a final design that meets the originally budgeted timing in a

single pass.

The methods used are not particularly design specific and can be applied to similar or even larger SoCdesigns to assist in shortening the overall design cycle.

Since this is a confidential customer design of Synopsys Professional Services, the application in

question is not disclosed. The tools and flow used were installed at both the customers site and atSynopsys design centers.

Author Biographies

Chris Smith received his Bachelor Of Engineering (Honors) in EE from the University Of Wales

Institute of Science and Technology (UWIST) in 1981. Since then has continuously worked in the

electronics and EDA industries, his first ASIC was designed in 1984/5 using schematic capture on a

Daisy Logician workstation for a 6K gate ASIC in a 3u DLM process. He has worked for BritishAerospace, LSI Logic, Motorola, and Cascade Design Automation. Since 1998 he has been a consultant

for Synopsys Professional Services.

Devin Bright received his Bachelors of Science degree in Electrical and Computer Engineering in 1991

and his Masters of Science degree in Electrical Engineering in 1993, both from the University of Iowa.

Since 1999 he has been a consultant with Synopsys Professional Services where he has assisted multiplecustomers design numerous complex ASICs. Prior to joining Synopsys he worked for Motorola

architecting cellular infrastructure equipment and designing necessary embedded silicon.

The authors have worked together on various Synopsys Professional Services consulting projects with

customers to introduce similar design flow methodologies, as are discussed in this paper, for a widerange of applications and technologies.


3/17

I n t r o d u c t i o n

Top Dow n Flow Overv iew And Flow Compar ison

Implementation of the ASIC is primarily a top down approach but some steps, such as initial blocksizing estimates, are performed in a bottom up manner where necessary to reduce iterations.

In a top down approach the chip is implemented from the die size, I/O timing and block placementrequirements downwards while in a bottom up approach the chip is implemented from block levelsynthesis results upwards.

Tradi t ional f loorplanning based design f low

In many design flows floorplanning tools have been in use for several years as an interface to detailed

place and route. Floorplanning has typically been seen as a task that aids the back end design team and is

usually performed late in the design implementation schedule, this does little to enhance the efficiencyof the front-end design team during the early stages of design realization.

Feedback to the synthesis process occurs during the later stages of design implementation, in the form of

extracted RC data from post route, and in general the problem of timing closure becomes a process of

the back end team iterating the floorplan and place and route and re-generating RC data for the front endteam to analyze.

In addition for large SoC designs the critical top level, inter block, timing and routing information is not

generally available until after all of the blocks have been through place and route this is especially truefor design flows that flatten the complete design prior to place and route.

The inefficiencies resulting from this late stage data feedback often results in needless and costly

iterations between the front end and back end design teams.

BlockLevel

Synthesis

Netlist

Assembly

Floor-

planning

Place &

Route

Chip

LevelSTA

RCExtractio

n

Block Level

Constraints

Chip Level

Constraints

Front End Back End

Timing Closure

Loop

Figure 1: New Tools Old Flow Methodology


4/17

Floorplanning as a f ront end tool

The floorplanning stage of the design flow can, and should, be moved into to the Front End of thedesign process to speed up the process of timing closure for large SoC designs. The flow as depicted in

Figure 1 usually requires the handoff of data between multiple teams, and as a minimum, leaves full,

accurate, knowledge of the chip level timing until all blocks have been placed and route along with thetop level interconnect. Typically chip level constraints are created, in a manual process, from the block

level constraints and seen as a separate step from creation of the block level constraints. In many casesfull chip level constraints are created after the fact from the individual block level constraints used for

synthesis.In a top down approach, physical constraints (e.g. block abstract, pin placement, top level routes, etc.)

and timing constraints are pushed down from the chip level to the block level early on and used to drive

the parallel implementation of each block. The overall chip is sized based on early area estimates for softmacros, which can be derived from a quick synthesis run (not necessarily with finalized RTL), the

designers judgment and experience is needed to estimate sizes for incomplete or missing blocks. The

overall core area is then partitioned amongst each block proportionally based on the estimated arearesources required and overall timing requirements.

Similarly timing constraints are defined at the chip I/O level and pushed down to the block level.

Special consideration, or budgeting, needs to occur for those paths that are inter block in nature (i.e.beginning in one block and ending in another block, potentially passing through multiple blocks inroute). For such a path, the overall timing is budgeted amongst the blocks through which it passes,

including net delays through for top-level routing paths. Using PrimeTime to generate block level

constraints from a set of chip level constraints is key to this flow.Early definition of top level routing and timing is essential for large designs if timing closure iterations

are to be minimized.

Top-

LevelNetlist

Assembl

Floor-

planning

BlockLevel

Route

Chip

LevelSTA

FinalRC

Extraction

Block Level

Constraints

Chip Level

Constraints

Front End Back End

PrimaryTiming Closure

Loop

Top-

Level

Routing

RC

Extractio

n

Module

Stubs

Block

LevelSynthesis

Secondary TimingClosure Loop

Figure 2: Floorplan Centric Flow


5/17

Such an approach, of early allocation of constraints, creates an environment in which the blocks can berealized independently in parallel. When the blocks are completed then, by virtue of meeting all

pushed down block level constraints, the overall chip can then be assembled. Assuming an accurate

budgeting process was performed, the composite chip is then well positioned to meet chip level

constraints in a single pass.The main timing closure loop occurs in the front end of the design flow where tradeoffs between

block size, RTL implementation and top level routing can more easily be made. By providing accuratetop level routing and good block level routing estimates it is possible to generate a consistent set ofblock constraints which, if all met individually, will result in overall chip level timing closure.

The back end process then becomes the relatively simple task of completing block level routing for each

block in the design driven by the physical and timing constraints generated during floorplanning andtop level routing.

Design Descr ip t ion

The ASIC is an element of a highly integrated processing subsystem for a next generationcommunication system. As with many typical communication systems, embedded memory

requirements were large, consisting of approximately 1.9M bits of memory. This logical memory wasimplemented in nearly 200 discrete memory macros in order to maintain the required memory cycletime speeds. Additionally, another 1.6 M discrete gates were required to implement the desired system

level functionality in the selected 0.25 u, 5 metal layer process.

This design is mostly synchronous, using primarily 4 clocks derived from two multiplying analog PLLs(APLL), with a few additional, non-synchronous clocks associated with external peripheral interfaces.

The fastest clock output from the APLL, and also the most prevalent, was a 135 MHz clock.

PCI Interface

APLLs

JTAG BSR Blocks

JTAG BSR Blocks

Figure 3: Chip Top Level Blocks


6/17

Top Leve l F loorp lann ingDie Sizing

The ASIC was sized at 9.25 mm by 9.25 mm. The die size is the primary physical constraint, within

reasonable bounds the die size can be varied to trade off and optimize numerous technical and non-

technical criteria. These criteria include packaging requirements, wafer yield (and associated cost per

die), ease of place and route, and timing requirements. For a hierarchical implementation spacing forinterblock routing at the top level must also be accounted for and may be traded off against increasing

die size or decreasing block sizes and increasing potential congestion issues.

I /O Plac em ent

The ASIC was designed using standard perimeter I/O, a key constraint driving the I/O placements wasthe signal groupings per the board layout. The I/O were positioned around the die such that when

bonded out via the package, they were positioned consistent with the requirements derived from the

board layout and routing tracks.In addition, the need to manage timing skew across a wide bus was also another constraint for I/O

placement. The skew sensitive I/O signals were evenly distributed around two sides of the dies lower

right corner. Given that the communicating block emanated from the same corner, such an approach

minimized the difference between the farthest and nearest I/O in the bus. This was a first step atproactively managing the route variability, or skew across the bus.

For the ASIC, three separate power supplies (core, I/O, and analog) were required in varying quantities

to supply various parts of the design. The core supply pads were evenly distributed around all four sidesof the die to power a core assumed to be uniform in its dissipation. The I/O power pads were placed

such that the peak I/O power dissipation between any two adjacent pair would be nearly consistent. The

analog power pads were positioned in close proximity to the APLL macros that required them.Beyond the macro positioning discussed so far there were also micro positioning requirements that

required attention. Most notably was the concept of the minimum I/O pitch requirement, which was

satisfied by respecting the spacing requirement. Less obvious was the minimum wire pitch requirementin place for the bonding wires. These requirements could not be checked until the die size was fixed and

the package substrate designed. The likelihood for impact from this rule was minimized by providing

extra space between the I/O cells near the corners, where extreme angles (i.e. non-orthogonal) of the

wires were most likely to result in a spacing violation.

Block Sizing and Placem ent

Block sizing and positioning is, by necessity, an iterative process in which block aspect ratios, blockarea, and overall placement must be balanced to produce a floorplan that will meet the die area,

routeability, and timing requirements. While automatic block level placement may be used to provide an

initial starting point, for complex SoC designs it is usually necessary to use manual intervention to

create an optimal floorplan solution. Determination of an optimal floorplan requires that the tools usedprovide meaningful metrics and feedback by which to measure the correctness of the floorplan.

Chip Architect provides visual connectivity feedback in the form of rats nest fly-lines between theblocks, and weighted net connectivity, which show the number of connections between blocks.

Quantitative metrics such as total Steiner Route wire length, individual net global route lengths or

estimated resistance/capacitance from global routes may be used to monitor whether a given floorplan

improves upon its predecessors.


7/17

Rats Nest Flyline Display Weighted Connectivity Display

Figure 4: Different Connectivity Displays

For the ASIC, a rough estimate of each blocks area was obtained by a quick synthesis (using Synopsysdesign_compiler) to obtain a gate/instance count and then basing the block size on the instance and/or

gate count placed at a certain density (typically 80% was used). A physical hierarchical partitioning wasthen selected which attempted to keep related pieces of logic together, while still balancing block sizesand minimizing block interconnectivity. A balance needs to be made between the size of core blocks

and the number of core blocks. This ultimately resulted in 11 core logic blocks and 4 BSCAN blocks.

Once identified, these block sizes were used to create the initial floorplan layout. The floorplan was

created and manipulated using Chip Architect.The four BSCAN blocks had extreme aspect ratios; the purpose of these blocks was to provide an array

in which the JTAG Boundary Scan Registers (BSR) could be placed. These blocks were long and

narrow, effectively spanning the I/O on a side.As an early floorplan quality check, Chip Architect pin assignment and global router capabilities were

used to identify any major problems. Sufficient block perimeter to support the pins at the desired pitch,

as well as channel utilization were two of the metrics monitored. By this approach, major top-levelrouting problems were identified early on when they were easiest to correct.

First Pass Block Pin Placem ent

Initial soft macro block pin placement is determined using Chip Architect in a top down manner, pin

assignment being driven only from the top level interconnect of the design. Global routing and

congestion analysis are used to determine the QoR of the pin placement and manual or scripted

modifications to block pin locations made to improve QoR in an iterative process.Block level pin constraints are used to restrict the pin placement to specific routing layers, required grid

spacing, and specific sides of each block.

For simple pin assignment tasks (such as straight bus connections between blocks and I/O pads) the

Chip Architect results, which are based upon a fly-line connectivity analysis, are usually satisfactory, forcertain other regular pin assignments scripts were generated to create design specific pin placements a

specific example of this is discussed below for the BSCAN blocks.Once the top-level netlist is stable and the top-level floorplan is reasonably defined further refinements

on the block pin assignment and a more complete analysis of the top level routing is performed using

FlexRoute.


8/17

BSR Pin Plac em ent

The 4 BSCAN blocks had extreme aspect ratios that resulted in a limited number of routing resourcesacross the shorter width of the block. To avoid congestion problems inside of the blocks, pin

assignment for each BSCAN cell input had to be closely aligned with the associated BSCAN cell output.

Since both Chip Architect and FlexRoute have no capabilities to match input/output pin pairs scriptswere developed to assist in pin placement for this special case.

The pins on the I/O side of the blocks were automatically aligned with their respective I/O cells as a partof the default Chip Architect pin assignment. The pins on the core side were then aligned, using a design

specific Tcl script, with the matching pin on the I/O side of the block. Without this scriptingmethodology the core side pins would tend to clump together around the top level routing channels into

which their connections flowed. While this would be effective for easing top level routing it would

cause significant routing congestion issues inside of the BSCAN blocks.It was felt that the greater constraint on these pin placements was the limited routing resources inside of

the blocks as opposed to the top level routing resources and so a small penalty in top level routing was

traded off against easier to route BSCAN blocks.

Loose Top Level Cells

The design includes a few (less than 20) top-level cells that are used to implement system and test clockmultiplexing and APLL test control logic. These cells did not lend themselves well to fly-line driven

auto placement and it was found to be more effective to manually place them inside of Chip Architect.

Since in many cases they form the start or top of a clock tree network they tended to be rather simple tolocate in the center channels of the floorplan.

Top Leve l P in Ref inement And Rout ing

Top level routing was implemented using FlexRoute, a gridless, n-layer, object-based router. The initialpass of FlexRoute is used to refine the block pin locations based upon real routing considerations.

FlexRoute performs Routing Based Pin Assignment (RBPA) by running multiple global routes in a

process that aims to find the optimal pin location for each signal that traverses the top level. As in Chip

Architect, pin locations are also driven by block specific parameters for preferred layers and sides ofblocks.

The primary objectives for pin assignment are assigning the pins so that the top-level routes are minimallength and do not cause congestion hot spots between the blocks. This tends to imply that when possible,

pins are assigned such that straight line, single metal nets provide the top-level connectivity. Congestion

analysis can be performed and further constraints applied to specific nets or routing channels within the

chip to reduce congestion hot spots.Once the pins assignments have been updated (and fed back to Chip Architect) a full detailed route of

the top level nets can be performed in FlexRoute. Although FlexRoute is gridless and will perform pin

placement based upon simply meeting design rules it is necessary to restrict the block pin placement tobe on grid. This means that the block level router (which is a gridded router) has a simpler task when

routing to the block boundaries from the inside.The FlexRoute global and detail routing engines can take account of net or net segment specificparameters such as metal width and spacing, net shielding, preferred layers and coupling capacitance

avoidance. Utilizing increased spacing between critical routes, such as clock lines, helps to minimize

any propensity to noise susceptibility due to long routes running in parallel with each other.

For this particular design all blocks were designated as total keepouts for top level wiring. It ispossible to route over the blocks by allowing one or more routing channels in different metal layers but

this does add some complexity to the extraction of both top level and block level parasitics in isolation

since the interaction of the over-block and internal block routing must be considered.


9/17

Top Level Repeater Insert ion

In a bottom up approach block output loading constraints are conservatively estimated or block outputdrivers are set to the largest available driver in the library. Whether these estimates are correct, or

whether the largest driver in the library is sufficient is not known until the top level timing is analyzed.

Typically the top-level timing cannot be analyzed until the final block integration and detailed routestages of the design flow are completed at which point iteration back to the block level and/or insertion

of top level buffers is required to implement any changes required.For a top down flow, such as the one outlined here, it is critical that accurate top-level timing and

loading information is available early in the design flow in order to drive the budget generation process.Top level routing can be completed in FlexRoute without any knowledge of the timing requirements but

for many nets the resulting RC network may be too large to be driven by the output of a block. As a

result it is also necessary to insert repeater cells (buffers or inverters) at the top level of the design toboth maintain block input edge rates and reduce the output loading of the blocks output driving cell.

Following top level routing FlexRoute is used to insert buffers based upon the top-level RC network

(extracted by FlexRoute) and the driving and load cells within the blocks. The information regarding thedriving and load cells, and any RC network inside of the soft macro blocks, is provided to FlexRoute by

Chip Architect in a file format known as TBEF (Timing-Based Exchange Format) which contains the

following data:* Driver cell name and the distributed RC network from the driver output to the block pin location* Receiver cell names and the distributed RC network from the block pin location to the receiver

cell input

* Input rise time (maximum transition time) at each receiver cell input* Net delay constraint from the driver cell input to each receiver cell input

* Database library cell name and the circuit's operating conditions (process, voltage, and temperature

values)Figure 5 shows how the various drivers, loads and RC networks are modeled using TBEF

Load/Drive From PT

Network modeled internally by FlexRoute

Network modeled by TBEF File

Figure 5: Top Level TBEF Modeling

FlexRoute uses PrimeTime's delay calculator, when it calculates the cell delay for the hierarchical driver

and repeaters both the hierarchical drivers/loads and the repeaters are analyzed during optimization.


10/17

At early stages in the design flow, such as when blocks are incomplete, default values for driver/loadcells and/or block internal RC networks can be used.

For this design the initial repeater insertion flow was ran in a mode where the aim was to fix input edge

rates only, this results in a top level netlist and associated RC data that meets the technologies basic

DRC rules. This netlist and top level RC data is used to accurately model the block level interconnectduring the push down of t he top level timing constraints and block level budget generation.

After synthesis and placement of the blocks (with the generated block level budgets) repeater insertion

was re-ran to update the top level netlist with any necessary repeater size changes and/or new repeaterinsertions; during this run net delay constraints were used on paths identified as critical during the top

level timing analysis.

Pin Placem ent Analys is

As mentioned above the QoR of a floorplan and associated pin placement can be judged by bothqualitative (e.g.. are the congestion hotspots getting smaller) and quantitive (e.g. is the total wirelength

reduced) metrics. Figure 6 shows three steps in pin assignment using the same floorplan along with the

associated total wirelength.

Wirelength = 36,155u Wirelength = 25,309u Wirelength = 21,766u

Figure 6: Wire Length Results For Various Pin Placements

The first snapshot is of the initial floorplan with no top-level pin assignment so pin placement is driven

(upwards) purely by the block level pin requirements. The second snapshot shows the results of a topdown pin assignment using Chip Architect. The third snapshot shows the results of a combination of

scripted pin placements for the BSCAN blocks and RBPA from FlexRoute.

Clearly both the actual floorplan layout and the block level pin placement have a dramatic effect on the

total routing length (and by inference the congestion) at the top level of the design.

Block Leve l Placement

Given that the block aspect ratios and pin assignments have stabilized, block level internal refinement

and macro placement can begin. For those blocks that have macros (e.g. RAMs, ROMs, etc.), manual

pre-placement must be performed. These macros were placed such that the placement array, the area inwhich the discrete standard cells will be placed, was contiguous and absent needless fragmentation.

Close attention was paid to macro to obstruction placement such that all pins are fully accessible at the


11/17

time of route. This meant taking into account the location of the memories with respect to block edgesand power supply straps and trunks.

In the case of this ASIC, 5 blocks had approximately 180 discrete memories that required placement.

These physical memories formed a few dozen larger logical memories; hence a significant portion of the

placement exercise was properly grouping related memories.Upon completion of any macro level placement, the block level power plan is finished inside of Chip

Architect to provide a true representation of the placement keepout areas and blockages within the

block. For this particular design a series of block level rings were used to provide block level power asopposed to a global power striping approach.

Once the block is completely described the block physical information can be exported for use in

Physical compiler where the timing constraints and RTL or gate level netlist (as was the case for thisdesign) are read in to provide both physical and timing constraints. A Gates to Placed Gates flow was

used for this design where the design was synthesized to gate level using Design Compiler and then

Physical Compiler was used to place and optimize the resulting netlist. This choice of flow was driven

by the BIST test insertion tool requirement to operate on a single chip level gate level netlist in order toinsert the requisite test points and test control circuitry. Due to the test tools lack of technology specific

knowledge regarding cell drive strengths and circuit optimization strategies this required that a

preliminary optimization be performed using Design Compiler prior to loading the design into Physical

Compiler. A better approach would be to use test tools that operated at the RTL level and werecognizant of the design timing requirements, this would allow a more flexible RTL to placed gates flow

to be utilized. As of the writing of this paper there are no such tools openly available.Physical Compiler optimizes the design based upon several cost functions such as routing congestion,

DRC costs (e.g. transition time, fanout, etc.), timing, etc. while performing standard cell placement.

Such an approach allows for the prediction of routability, timing performance, along with the ability tofix most DRC problems.

Since the block constraints are developed top down and are independent of each other multiple, physical

Compiler runs can be used to process the blocks in parallel. There is no need to wait for one or more

blocks to finish in order to derive the constraints for other connected blocks in this way the overalldesign cycle time can be reduced significantly. Due to the speed of the flow it was not found necessary

to implement an ECO type flow since, once the flow was set up, block level changes could beimplemented very quickly from scratch.Block Level Congest ion Analys is

Chip Architect allows the user to generate routing congestion maps based upon a global route. While notthe actual final route it can be shown that the level of detail used in this global route models the final

Cadence SE detail route to a high level of accuracy. The global route takes into effect the various

routing layers available, the number of individual wires that can fit in a given routing channel, andobstructions to routing such as power rails, wiring keepout areas and cell ports.

In the congestion map areas highlighted as over 100% utilization should be examined carefully; in some

cases the detail router will be able to detour around the congested areas to find a solution; in other cases,

such as those close to block perimeters or in-between large hard macros even a single point of 100%utilization may cause routing issues later in the flow. In general the more highly congested the routing

area, and the more widespread the congestion, the less likelihood that there will be good correlation

between the Chip Architect route estimates and the Cadence SE detail routes.


12/17

Higher Congestion

Area

Lower Congestion

Areas

Figure 7: Block Level Congestion Analysis

The congestion map for one of the larger blocks is shown in Figure 7, it is difficult to interpret in black

and white but below is a textual report of the congestion (the actual display in Chip Architect is color

coded).

The report shows the percentage of global routing resources for each layer that fall into a particularutilization figure, the rows do not all add up to 100% since the fixed obstructions such as power straps

are excluded.

Metal Peak


13/17


14/17

The histograms below show the results of comparing Chip Architect, FlexRoute, and HyperExtractfollowing top level routing. HyperExtract is taken as the golden data in this case since it performs a

full (and slow) 3D extraction.

Based upon the results shown below (and other testcases) we are confident in using the extraction results

from FlexRoute in guiding both our top level timing and our overall chip level timing closure.

Chip Architect Vs HyperExtract Capacitance Delta

0

100

200

300

400

500

600

-1 0 1 2 3 4 5PF

Number ofNets

Figure 9: Chip Architect Vs HyperExtract Capacitance Delta Histogram

As can be seen from the histogram Chip Architect extraction shows significant differences from thetrue HyperExtract data. This is due to the length of the top-level routes and the additional spacing

and/or shielding that appears in the routing and is not modeled in the Chip Architect global routes.

Capacitance Delta

0

200

400

600

800

1000

1200

-0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3

PF

NumberofNets

Figure 10: FlexRoute Vs HyperExtract Capacitance Delta Histogram


15/17

The FlexRoute 2.5D extraction matches very well with the HyperExtract results, there are still somedifferences but these occur only on long nets where the actual delta is a small percentage of the total

capacitance.

Clock Tree Methodology

The clock tree methodology planned up front was to create independent clock distribution trees in thehierarchical blocks and a tree at the top level of the design to balance skew between the blocks - for thisapproach to work each block was required to have the same insertion delay for a given clock.

The minimum insertion delay for a clock was determined by processing each clock/block combination

through Cadence CTGEN. As expected the largest blocks, with the largest number of clock endpoints,resulted in the largest insertion delay.

A small overhead was added to the minimum insertion delay to ensure repeatability if the number of

loads in a block increased and this insertion delay was then used to create equivalent depth clock trees inthe smaller blocks. Since several blocks had multiple clocks several runs of CTGEN were required

initially to build up this information.

As the insertion delays on separate trees were matched, care was taken to make sure that the number of

levels of buffers in the trees remained constant. This helps to maintain insertion delay tracking overprocess temperature and voltage.

With the blocks having equal insertion delays the problem of clock tree insertion at the top level became

one of creating a zero skew network from the clock source (typically a loose standard cell Mux at thetop level) to the clock input pins on each top-level block.

Attempts to use several different automated clock tree methodologies, including CTGEN, produced less

than optimal results at the top level. This was concluded to be due to the very large area at the top levelcombined with sparse placement area availability and the small number of clock loads.

In the end a manual approach was taken to inserting the small tree required for the top-level clock

balancing. A Tcl script was developed which allowed the user, using Chip Architect, to manually insert

a buffer into a clock net, place and route it in the floorplan, and receive immediate feedback as to thebalance of the loads it is driving.

To provide quick feedback the output net connected to the newly inserted buffer was incrementally

routed and a Reduced SPEF generated for the net and its loads. Since the rspef reduces the driven RCnetwork into an equivalent RC load for the driving cell, and a set of endpoint delays for the loads on the

net it is a simple task to parse the resulting rspef and provide accurate feedback regarding the skew from

the inserted buffer to the loads it drives.In order to speed up the analysis each of the top-level blocks were replaced by an extracted STAMP

model of its internal netlist, or an estimated input pin loading where blocks were incomplete. The use of

abstracted models ensures that the RSPEF extraction and subsequent analysis runs in a matter ofseconds, without these models the RSPEF extraction takes a significant amount of time since it traverses

the hierarchy to each clock tree end point.

By working backwards from the load points (i.e. the inputs to the block) it was a simple matter to

balance each level of the top-level clock tree. Due to the small number of block input pins to be drivenonly three levels of clock tree, using 6 buffers, were required for the largest tree.

After the complete tree was assembled in a piecemeal fashion a full clock tree analysis was run in

PrimeTime using post route data from FlexRoute to verify final results.For the largest block the largest skew value was 280ps, the worst-case top level skew was 40ps, the

worst-case difference in insertion delay for the blocks was 20ps. The final, post route, measured clock

skew across the chip was 330ps worst case.


16/17

Mul t i B lock T iming Analys is

Since DSPF is not a hierarchical format the individual DSPEFs for each block and the top level must bemerged. Full chip timing analysis is achieved by merging the multiple DSPEF files generated for each

block and completing the RC networks at the interface to each block within PrimeTime. Rather than

wait for all blocks to be completed it is useful to perform partial chip timing analysis simply byannotating the top level and those blocks that are available.

For the final timing analysis the top level DSPEF is extracted from the FlexRoute database while theblock level DSPEF files are extracted from a post route Cadence SE database using Synopsys Arcadia or

Cadence Hercules.At intermediate stages of block completion DSPEF generated from Chip Architect may be used as an

accurate estimate for the block level post route DSPEF. As indicated previously this can be performed

with a high degree of confidence that the timing will match post route timing. The script environmentused to back annotate the DSPEF used a search path to determine which whether to back annotate the

actual Cadence DSPEF or the Chip Architect estimated DSPEF if the Cadence actual DSPEF was not

available.

FlexRoute

Top LevelDSPEF

BlockVerilog

BlockDSPEF

Top LevelVerilog

ChipArchitect

PrimeTime

Chip Timing

Constraints

Full Chip

SDFPrimeTime

Top LevelLEF/DEF

Figure 11: Full Chip Timing Analysis Flow

As each block level DSPEF file is annotated in PrimeTime an error message is generated for each

primary I/O of the block. These messages are generated since, at the time of reading the block, the RC

network is incomplete for primary I/Os (i.e. it does not contain a complete driver, RC network, andreceiver). After the final top-level DSPEF file is annotated a final check (using the PrimeTime command

report_annotated_parasitics list_not_annotated) is performed to ensure that all nets, especially thosethat cross hierarchical boundaries, have been annotated correctly.Once all of the sub blocks, and the interconnecting top-level design, have been back annotated with

DSPEF the PrimeTime command complete_parasitics is used to merge the interblock networks.

A technique used to speed up multiple chip level timing analysis runs was to back annotate all of the

DSPEF files and then write out a full chip SDF file for each operating condition. Subsequent timinganalysis runs then save time by simply reading the existing SDF file instead of the relatively slow

process of annotating multiple DSPEF files and calculating delay information. Tcl scripts using search

paths and file time stamps can be used to automate this process.


17/17

Chip Assembly

As individual blocks finish routing and are verified for post route timing they can be individuallyverified for DRC/LVS issues in a stand-alone mode.

The final chip assembly operation occurs inside of a GDSII Editor where the individual blocks are

merged with the top level GDSII. This particular design used a Synopsys tool (SLE) for GDSII mergingbut the same process is easily accomplished in other tools.

Once assembled it is necessary to perform LVS and DRC runs to check for problem at the interfacesbetween the top level routing structures and the block level structures.

Layout

Editor

Figure 11: GDSII Merging Process

Summary

Using the flow outlined it has been shown, on a real design, that timing closure can be driven from the

top level of a design, and that timing closure need not wait for a full flat GDSII file of the design to

perform extraction on.

Using top down physical and timing budget generation individual SoC blocks can be implemented inrelative isolation from each other and still provide timing closure at the chip level when they are re-

assembled. The critical top level detailed routing can be completed early in the design flow and the

extracted RC information used to drive the block level budgeting and synthesis process. The back enddesign flow involvement in the timing closure loop is reduced to the detail routing and extraction of the

individual SoC blocks.

Adoption of the flow, and the use of Chip Architect, FlexRoute, and Physical Compiler may require achange in the organization or roles of typical design teams as more of the physical implementation is

performed in parallel with the development of the RTL and ideally by the same engineers.

Date post:	04-Apr-2018
Category:	Documents
Upload:	krishna-m-viswanath
View:	234 times
Download:	0 times

Top Down Asic

Documents