11
DAC2000 (C) Monterey Design Systems 1
Physical Design ClosurePhysical Design Closure
DAC 2000
Olivier Coudert
Monterey Design System
DAC2000 (C) Monterey Design Systems 2
DSM DilemmaSOCTime to marketMillion gatesHigh density, larger dieHigher clock speedsLong wiresProject managementRe-use, IPsLarger databaseLarger design space
Need abstraction levels to manage complexity
Require detailedanalyses to understand physical interactions
Acc
ura
cy
DSMHigher resistanceHigher cross-
couplingNon-linear timingPowerElectromigrationIR DropInductancesetc ...
Abs
trac
tion
22
DAC2000 (C) Monterey Design Systems 3
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
DAC2000 (C) Monterey Design Systems 4
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
33
DAC2000 (C) Monterey Design Systems 5
Timing & Placement
n Interconnect dominance makes DSM netlist signoff difficult
n Wireload models were ALWAYS inaccuratel Post-synthesis signoff was possible when interconnect
contributed ~20% of the total capacitancel But now the interconnect-C is becoming dominant over the
total-C with each new process generation
0
0.05
0.1
0.15
0.2
0.25
0.3
1992 1995 1998 2001 2004 2007
Wire Cap.(fF/um)
DAC2000 (C) Monterey Design Systems 6
Long-Wire Problems
n For DSM designs the metal resistance further complicates timing prediction and closure for the global wiresl Average long-wire length is not scaling with new
technologies since the systems are becoming bigger
Occ
urre
nce
Rat
e(N
orm
aliz
ed)
die sizewire length~0.5
Local wires
Global wires
44
DAC2000 (C) Monterey Design Systems 7
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
DAC2000 (C) Monterey Design Systems 8
n Quadratic placementl fast l restricted cost function, e.g., timing driven placement
mimicked with net weighting
n Simulated annealingl open cost functionl extremely slow
n Force directedl semi-open cost functionl slower than quadratic placementl tuning more difficult
n Bisection (mincut + partitioning)l open cost functionl slower than quadratic placement
Placement
55
DAC2000 (C) Monterey Design Systems 9
Netlist Clustering
n Start placement by building a hierarchical tree of cell-clusters from the netlist (hMetis DAC’97)
n A key to optimal placement is to optimize the size and locations of these clusters
n Both functional hierarchy and netlist topology need to be considered
A CB
Netlist
FED
DAC2000 (C) Monterey Design Systems 10
Placement
n The clusters are sized and placed within partitions and among megacells
n Long wires are modeled among partitions, and congestion is approximated within partitions l Initially, congestion is dominated by local wiresl Early wireplanning for long wires will not work
66
DAC2000 (C) Monterey Design Systems 11
Placement
n This process continues to smaller clusters and smaller partitions
n “Long” wires are not “planned”, but are “placed” probabilistically in terms of where the router is likely to want to route them
DAC2000 (C) Monterey Design Systems 12
Placement
n This process continues to smaller clusters and smaller partitions
n “Long” wires are not “planned”, but are “placed” probabilistically in terms of where the router is likely to want to route them
77
DAC2000 (C) Monterey Design Systems 13
Placement
n This process continues to smaller clusters and smaller partitions
n “Long” wires are not “planned”, but are “placed” probabilistically in terms of where the router is likely to want to route them
DAC2000 (C) Monterey Design Systems 14
Placement
n One eventually reaches a cluster and partition size for which timing and congestion are predictable
n Timing signoff can be done at this level ONLY!
88
DAC2000 (C) Monterey Design Systems 15
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
DAC2000 (C) Monterey Design Systems 16
Placement
n Cells are non uniformily distributed into binsl Dynamic whitespace allocation addresses congestion at
the global level
99
DAC2000 (C) Monterey Design Systems 17
Placement
n Cells are nonuniformily distributed at subfloorplan levell Dynamic whitespace allocation addresses congestion at
the global level
n Inter- and intra-partition congestion is predictable at this placement level
DAC2000 (C) Monterey Design Systems 18
Non-Uniform Whitespace Mgmt.
n Example of whitespace allocation after timing driven placement and optimization
White Spaceadded to relievecongestion
White Spaceadded to relievecongestion
White Spaceadded to relievecongestion
White Spaceadded to relievecongestion
White Spaceremoved to help relievecongestionin other areas
White Spaceremoved to help relievecongestionin other areas
Movement of cellsfor timing optimizationMovement of cellsfor timing optimization
1010
DAC2000 (C) Monterey Design Systems 19
Placementn The placement algorithm generality and common database
provide for the front-to-back logic optimization, control of wiring, etc…
n These same features provide for powerful ECO capabilities tool Netlist can be adjusted via API at all levels of the placement
progressionl Design progress can be viewed and manipulated at every
placement level
DAC2000 (C) Monterey Design Systems 20
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
1111
DAC2000 (C) Monterey Design Systems 21
Timing Prediction
n As the routing models become more precise, so do the timing predictions for the long wiresl The timing/delay models and analyses are only as precise as the
physical informationl New metrics provide excellent correlation from front-end to back-
end
DAC2000 (C) Monterey Design Systems 22
Timing Prediction
n As the routing models become more precise, so do the timing predictions for the long wiresl The timing/delay models and analyses are only as precise as the
physical informationl New metrics provide excellent correlation from front-end to back-
endn Intra-partition wiring delays are accurately predicted at this partition
size too
1212
DAC2000 (C) Monterey Design Systems 23
Timing Optimization
n The first tech mapping was an approximation, since the wiring capacitances were not known
n With sufficient physical information at the placement level, we begin timing optimization
n Buffers are inserted for shielding, delay and attenuation
DAC2000 (C) Monterey Design Systems 24
Timing Optimization
n Buffers are added only when it is determined that they will not have to be removed
n Global routing is used to place the buffers and inverters
n Long wires are “seeded” by buffersl Long wire “design” is driven by accurate physical
information
1313
DAC2000 (C) Monterey Design Systems 25
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
DAC2000 (C) Monterey Design Systems 26
Logic Optimization
n “Analytical” approachesl Assume continuous “size”l Fastl Map a continuous solution onto a discrete library l Use oversimplified models (e.g., Elmore delay)
n “Refinement” approachesl Can use complex and/or discrete modelsl Can mix a wide range of transformationsl Slowerl Strategy/control more difficult
1414
DAC2000 (C) Monterey Design Systems 27
Logic Optimization
n Placement provides enough physical information to accurately buffer, resize, remap, resynthesize, etc.
n Yet design is still abstract enough for global explorationl E.g.: Logic optimization for “global” congestion reliefl “Placement” is coarse enough that resizing in one region does
not require cells to be moved in anotherl More effective than completing a placement, feeding back
custom wireload models, and iterating…
DAC2000 (C) Monterey Design Systems 28
Logic Optimization
Buffering can help in reducing congestion too
1
2
3
4
5
Critical path
6 7 8
l Buffering targets slope fixing and timing
l Several algorithm, slack, delay, and slope driven
Shielding buffer fortiming optimization
1515
DAC2000 (C) Monterey Design Systems 29
Logic Optimization
n More aggressive for critical paths, e.g., logic collapsing and decomposition, logic duplication and logic sharing, logic remapping, logic resynthesis
2
21 path 1
path 2
both paths 1 & 2 are critical
DAC2000 (C) Monterey Design Systems 30
Logic Optimization
n The generality of the placement algorithm allows logic optimization to continue throughout the flowl No net constraintsl Continual monitoring of “what is critical”
n Includes simple logic restructuring for congestion relief:
1616
DAC2000 (C) Monterey Design Systems 31
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
DAC2000 (C) Monterey Design Systems 32
Clock Distribution
n Most clock tree synthesis algorithms attempt to build the clock tree post -placementl This is too late – congestion could disturb timing closure l But you can’t build it too early, since you don’t know where
the latches are
Clock Routing
Clock TreeGeneration
Placement
Floorplanning
Synthesis
1717
DAC2000 (C) Monterey Design Systems 33
Clock Distribution
n The placement should provide enough information to know the distribution of latches, but should be abstract enough to avoid being trapped by congestion caused by the clock wiring
DAC2000 (C) Monterey Design Systems 34
Clock Distribution
n First clock tree is created with the clock pins distributionl A complete buffered/gated tree can be automatically
synthesizedl The user has the option to instantiate the top portions of a
tree based on the distribution of latches and flipflops
1818
DAC2000 (C) Monterey Design Systems 35
Clock Distribution
n This clock tree congestion is used to predict the overall congestion, since the latch distribution will not change substantially from this point forward
n As the lower portions of the clock tree continue to grow, the top levels of the tree take rootl The top levels will continue to adjust slightly as the
placement and optimization processes continue
DAC2000 (C) Monterey Design Systems 36
Clock Distribution
n Accurate timing projections enable useful skew methods to be applied at this level
n Placement is still coarse enough so that objects with common-skew targets can be grouped
1919
DAC2000 (C) Monterey Design Systems 37
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power designn Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
DAC2000 (C) Monterey Design Systems 38
Power/Ground Distribution
n The placement also provides sufficient information to judge the quality and integrity of the power/ground networkl Power/ground network can have a huge impact on congestion
n Power rail currents will not change much as the placement is refined
n Yet there is enough space to add/widen stripesl API driven adjustment using incremental IR-drop analysesl Ultimately this optimization process can be automated
2020
DAC2000 (C) Monterey Design Systems 39
Power/Ground Distribution
n Eventually automation process will have to consider more detailed analysis too: l Inductance of chip and packagingl Resonance frequencies via ac analysesl On-chip decoupling
DAC2000 (C) Monterey Design Systems 40
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
2121
DAC2000 (C) Monterey Design Systems 41
Model refinementn Once the quadrisection level’s results are acceptable, we
proceed with a similar partition-based placement strategy
n The cost function includes timing, area, congestion, power, and eventually xtalk and signal integrityl There are no timing constraints fed forward!
n Logic optimization, buffering, whitespace allocation, etc., all continue on a more local scale
DAC2000 (C) Monterey Design Systems 42
Design Closure
Final statictiming analysis
Extraction &delay calculation
Sys
tem
RTL
Syn
thes
is
Model accuracy
time
Transformation scale
global/estimate local/accurate
Continuity and correlation are keys!
Timing
Logic opt.
Route
Place
2222
DAC2000 (C) Monterey Design Systems 43
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
DAC2000 (C) Monterey Design Systems 44
logical domain
physical domain
DelayCalculation
Extraction
Pre-DSM Design Flow
Routing
Synthesis
Clock Tree
Placement
Static TimingAnalysis
statistical WLM
RTL
timing library
(SDF, RC’s)
Netlist Signoff custom WLM
2323
DAC2000 (C) Monterey Design Systems 45
DelayCalculation
DSM Design Signoff
Timing
Route
Place
Rem
apStatic Timing
Analysis
Synthesis +opt. floorplan
RTL
No physical
information
at that level
Physical
implementation
Timing,
congestion,clock, etc,
predictable at
that level
DAC2000 (C) Monterey Design Systems 46
DelayCalculation
DSM Design Signoff
Timing
Route
Place
Rem
ap
Static TimingAnalysis
Synthesis +opt. floorplan
RTL Design signoff can only be done when DSM timing & congestion can be properly estimated: physical prototype level
No physical
information
at that level
Physical
implementation
Timing,
congestion,clock, etc,
predictable at
that level
2424
DAC2000 (C) Monterey Design Systems 47
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrityn Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
DAC2000 (C) Monterey Design Systems 48
Routing
lRequirements for the DSM Router:l N-layer shape-based routerl Supports gridless and gridded routingl Variable wire width for optimal delay constraintsl Cross-talk avoidance, antenna effectsl Clock tree sizing for tree balancingl Power routing sizing for voltage drop and
electromigration
2525
DAC2000 (C) Monterey Design Systems 49
Routing Correlation
n Global routing can utilize the whitespace to avoid long-distance couplings for critical netsl Extra spacing, shielding, or space for rip-up and
rerouten No surprises for the detailed router after GR
l Shape-based gridless area routerl Timing and xtalk awarel Spacing wires nonuniformily within and among
layers is handled without loss of generalityl Router capabilities are also critical for delay
optimization and satisfying reliability constraints
DAC2000 (C) Monterey Design Systems 50
Crosstalk
Coupling vs. Inter-layer capacitance
02468
1997 2001 2006 2009 2012
Cc/
Cs
Source: 1998 Update, International Technology Roadmap for Semiconductors
l Fact: the same layer coupling capacitance is beginning to dominate the total net capacitancel Makes cross-talk a dominant factor in
achieving timing closure
2626
DAC2000 (C) Monterey Design Systems 51
Crosstalk
n Neighboring-net switching can cause DR surprisesl Trying to solve this problem at DR is far too late!
n Passing constraints to Detailed Routing to avoid routing certain nets in parallel is easy, but DR is already overconstrained!
n The right way is to attack the xtalk problem starting at the proper placement level
DAC2000 (C) Monterey Design Systems 52
Crosstalk Delay Impact
n Simply modeling the coupling capacitance as grounded capacitance scaled by ~2x is overly pessimistic
n Timer should model early and late arrival times at all nodes (for each library) so that worst/best case switching can be determined during path traversall TACO: Timing Analysis with Coupling (DAC 2000)
2727
DAC2000 (C) Monterey Design Systems 53
Electromigration
n During clock-tree synthesis, top level wires are automatically sized to satisfy E/M constraints
n Below 0.25um we expect similar constraints for signal netsl Don’t wait until DR to determine layer assignments or find
extra space for wide wiresl The wire sizes and layers should be modeled at the earliest
possible placement level
DAC2000 (C) Monterey Design Systems 54
IR drop
l P = Pnet + Pint + Pleakl Simulation and/or probabilistic based dynamic power
evaluationl Power distribution at the chip level, along with the
quadrisection levell Consequently power distribution can be optimized
along with the other design variables
2828
DAC2000 (C) Monterey Design Systems 55
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype
Moore’s Law: Tapering off?
1
10
100
1000
10000
100000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Th
ou
san
ds
of t
ran
sist
ors
4004
8086
80286
80386
80486
Pentium
P.Pro
Merced
1
10
100
1000
10000
100000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Th
ou
san
ds
of t
ran
sist
ors
4004
8086
80286
80386
80486
Pentium
P.Pro
Merced
2x in 2 years
2.5 years
2929
Parallel Processing
n Parallel processing: The process of breaking a problem into multiple pieces and executing them simultaneously
n Speedup: Let T(n) = wall-clock time for executing the original task on n processors. Then speedup is T(1)/T(n)
n Speedup depends on:l load balancingl inter-processor communicationl scheduling
Load balancing
n Objectivesl Evenly balance computational loads among
available processorsl Minimize inter-processor communication
n This is a hard probleml Load balancing is a NP-complete!l The time taken to load-balance should be a
fraction of the total process time
3030
Job Scheduling
p q p p p q
10 independent Jobs, 2 Processors
p p p p
T(p) = 1
T(q) = 10Serial runtime: 8 * 1 + 2 * 10 = 28
Thread scheduling chart10 20
T2T1
Poor scheduling is detrimental for speedup
S(2) = 1.47
pq p p pq p p p p
Reschedule
Improving Job Scheduling
p q p p p q
10 independent Jobs, 2 Processors
p p p p
Thread scheduling chart10 20
T2T1
Simple scheduling algorithms improve speedup
S(2) = 2.00
3131
Inter-job Communication
kk
1 2
p q pq
Reducing job-communication improves scaling
Partition the problem!
jobs
Global Routing
n “Global” doesn’t lend itself to parallelismn “q” is a very big portion of each task
l q = updating of “global” congestion mapn Quality vs Speed trade of:
l Lazy updatel Multi-level partitioningl ...
kk
1 2
p q pq
3232
Global Routing: Lazy Update
n Algorithm:l Each parallel task gets a list of nets to be routedl While routing a net, “Global” congestion map
represents an earlier statel After a while, routing stops and congestion map
is updatedn Cons:
l Quality degradationl Possibility of slowing convergence due to
delayed congestion map
Global Routing: Multi-level partitioning
n Algorithm:l Divide routing area into partitions, at each level
partitions are non-overlappingl Levels could be 1x1, 2x2, 3x3, 5x5 ...
3333
Global Routing: Multi-level partitioning
n Algorithm (cnt’d):l At each level, routing within partitions can be
threaded.
Detail Routing
n Detail Router optimizes local interaction of routesn Localized, thus simple partition based threading
scheme:l Divide chip into small partitionsl Instantiate router on partitions in parallel
n Quasi-linear speed-up
3434
Detail Routing
n In reality, partitions will be overlappingl Better quality near partition boundariesl Can not route adjacent partitions concurrentlyl To minimize locks, need a scheduler
Speedup (n=4)Global Placement
Congestion modeling
Place - Logic interaction
Sizing
Buffering
Technology mapping
Static Timing Analysis (with crosstalk)
Clock generation
Power topology construction
Detailed placement
Global routing (with crosstalk )
Shape-based detailed routing (with crosstalk)
4321
3535
DAC2000 (C) Monterey Design Systems 69
Top 10 Impediments to Design Closure
n Strong placement/timing dependency n Timing/congestion interactionn Timing signoffn Signal integrity n Power design n Problem sizen Computational resourcesn Clock designn Modeling accuracyn Marketing hype