CSE241 L3 ASICs.1 Kahng & Cichy, UCSD ©2003
CSE241VLSI Digital Circuits
Winter 2003
Lecture 06: Timing
CSE241 L3 ASICs.2 Kahng & Cichy, UCSD ©2003
This Class + Logistics
TimingFlip-flop timing
Clock distribution
Clock tree synthesis
Reading: White papers on static timing analysis, papers on clock tree synthesis
Lab #2 due date: Monday January 27th
Slide courtesy of S. P. Levitan, U. Pittsburg
CSE241 L3 ASICs.3 Kahng & Cichy, UCSD ©2003
Review
Static timing analysis (Lecture 4)Pin-based timing graph
Directed acyclic graph (DAG) of timing arcs
Longest path in DAG time linear in #arcs (edges)
Slack = required arrival time – actual arrival time (long path analysis)
Logic synthesis (Lecture 5)
Slide courtesy of S. P. Levitan, U. Pittsburg
CSE241 L3 ASICs.4 Kahng & Cichy, UCSD ©2003
Static Analysis vs. Dynamic Analysis
c=0 c=1b=0 a-z delay1 a-z delay2 b=1 a-z delay3 a-z delay4
a
b
c
z
Why static analysis when dynamic simulation is more accurate?
Drawbacks of simulationRequires input vectors (stimuli for circuit)
Long runtimes
Example: calculate worst-case rising delay from a to zExponential explosion with number of possible design input states
CSE241 L3 ASICs.5 Kahng & Cichy, UCSD ©2003
90
10Time
Vdd50
STA Terminology(Actual) arrival time (AAT, or AT) = time at which a pin switches state
Usually 50% point on voltage curve, i.e., AT = t50
Slew time = time over which signal switchesUsually difference between 10% and 90% on voltage curve, i.e., tslew = t90 – t10
Required arrival time (RAT) = time at which a signal must arrive in order to avoid a chip fail
Slack = RAT – AATPositive slack good (= margin), negative slack bad
CSE241 L3 ASICs.6 Kahng & Cichy, UCSD ©2003
d=2
d=1
d=5
d=3
d=2
d=1
d=3
d=3d=1
temp at=3 temp at=7
at=0
at=0
at=0
at=1
at=2
at=5 at=6
at=5
at=8at=11
rat=10
Slack= -1
Example: What is slack at PO?
CSE241 L3 ASICs.7 Kahng & Cichy, UCSD ©2003
d=2
d=1
d=5
d=3
d=2
d=1
d=3
d=3d=1
temp at=3 temp at=7
at=0
at=0
at=0
at=1
at=2
at=5 at=6
at=5
at=8at=11
rat=10
Slack = 0
Example: Incremental Timing Analysis
at=10
d=1
d=1d=1
at=3
at=7
Amount of work is bounded by sizes of fanin, fanout cones of logic
CSE241 L3 ASICs.8 Kahng & Cichy, UCSD ©2003
0=aAT1=bAT
2=xRAT
1=xAT121 −=−=xSL101 =−=bSL
000 =−=aSL1=yAT
0=cAT
011 =−=ySL
a
b xc
y
Definitions change as followsRAT = lower bound on arrival timePropagate shortest possible instead of longest possible delaysSlack = Arrival – Required
Example: negative slack because ATc is too small (early)
1 1
110 −=−=cSL
Early-Mode Analysis
CSE241 L3 ASICs.9 Kahng & Cichy, UCSD ©2003
Enhancements of STA
Incremental timing analysis
Nanometer-scale process effects – variation (probabilistic timing analysis)
Interference – crosstalk
Multiple inputs switching
Conservatism of delay propagation
HW #8: Suppose you change the size of one (combinational) gate in your design, thus invalidating the previous timing analysis. How much work must be done to regain a correct timing analysis?
Courtesy K. Keutzer et al. UCB
CSE241 L3 ASICs.10 Kahng & Cichy, UCSD ©2003
Timing Correction
Driven by STA“Incremental performance analysis backplane”
Fix electrical violationsResize cellsBuffer netsCopy (clone) cells
Fix timing problemsLocal transforms (bag of tricks)Path-based transforms
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.11 Kahng & Cichy, UCSD ©2003
Local Synthesis Transforms
Resize cells
Buffer or clone to reduce load on critical nets
Decompose large cells
Swap connections on commutative pins or among equivalent nets
Move critical signals forward
Pad early paths
Area recovery
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.12 Kahng & Cichy, UCSD ©2003
Transform Example
Delay = 4
…..
Double Inverter
Removal
…..
…..
Delay = 2
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.13 Kahng & Cichy, UCSD ©2003
Resizing
00.010.020.030.040.05
0 0.2 0.4 0.6 0.8 1load
d
A B C
b
ad
e
f0.2
0.2
0.3
?
b
aA
0.035
b
aC
0.026
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.14 Kahng & Cichy, UCSD ©2003
Cloning
00.010.020.030.040.05
0 0.2 0.4 0.6 0.8 1load
d
A B C
b
a
d
e
f
gh
0.2
0.2
0.20.20.2
?
b
a
d
ef
gh
A
B
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.15 Kahng & Cichy, UCSD ©2003
Buffering
00.010.020.030.040.05
0 0.2 0.4 0.6 0.8 1load
d
A B C
b
a
d
e
f
gh
0.2
0.2
0.20.20.2
? b
a
d
e
f
gh
0.1
0.2
0.20.20.2
BB
0.2
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.16 Kahng & Cichy, UCSD ©2003
Redesign Fan-in Tree
a
cd
b eArr(b)=3
Arr(c)=1
Arr(d)=0
Arr(a)=4
Arr(e)=61
1
1
cd
e
Arr(e)=51
1b1
a
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.17 Kahng & Cichy, UCSD ©2003
Redesign Fan-out Tree
1
1
1
3
1
1
1
Longest Path = 5
1
1
1
3
1
2
Longest Path = 4Slowdown of buffer due to load
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.18 Kahng & Cichy, UCSD ©2003
Decomposition
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.19 Kahng & Cichy, UCSD ©2003
Swap Commutative Pins
2
c
ab
2
1
0 1
1
1
3
a
cb
2
1
0
1
1
2
1 5
Simple sorting on arrival times and delay works
DAC-2002, Physical Chip Implementation
CSE241 L3 ASICs.20 Kahng & Cichy, UCSD ©2003
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
CSE241 L3 ASICs.21 Kahng & Cichy, UCSD ©2003
Why Clocks?Clocks provide the means to synchronize
By allowing events to happen at known timing boundaries, we can sequence these events
Greatly simplifies building of state machines
No need to worry about variable delay through combinational logic (CL)
All signals delayed until clock edge (clock imposes the worst case delay)
CombLogic
register
CombLogic
register
registerDataflowFSM
CSE241 L3 ASICs.22 Kahng & Cichy, UCSD ©2003
Clock Cycle Time
Cycle time is determined by the delay through the CLSignal must arrive before the latching edgeIf too late, it waits until the next cycle
- Synchronization and sequential order becomes incorrect
tcycle > tprop_delay + toverhead
Can change circuit architecture to obtain smaller TcyclePipeliningParallelism
CSE241 L3 ASICs.23 Kahng & Cichy, UCSD ©2003
PipeliningFor dataflow:
Instead of a long critical path, split the critical path into chunks Insert registers to store intermediate resultsThis allows 2 waves of data to coexist within the CL
Can we extend this ad infinitum?Overhead eventually limits the pipelining
- E.g., 1.5 to 2 gate delays for latch or FFGranularity limits as well
- Minimum time quantum: delay of a gate
register
register
register
register
register
tpd tpd1 tpd2
tcycle > tpd + toverhead tcycle > max(tpd1, tpd2) + toverhead
CL
A+B
CL
A+BCL
A
CL
ACL
B
CL
B
CSE241 L3 ASICs.24 Kahng & Cichy, UCSD ©2003
Parallelism
For FSMs:Same functionality and performance can be achieved at half the clock rateHowever, the input and output signals must be doubled (to account for the outputs for each original cycle)Instead of doubling the delay, the optimized logic is often logarithmically related to the degree of parallelism
register
tpd
tcycle1 > tpd + tov
M-bits
reg
tpd
tcycle2 > Ntpd + tov
M-bits
tpd
reg
M-bits register
tpd
2*M-bits
tcycle3 > log(Ntpd) + tov
CLCLCLCL CLCL
Opt.
CL
Opt.
CL
CSE241 L3 ASICs.25 Kahng & Cichy, UCSD ©2003
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
CSE241 L3 ASICs.26 Kahng & Cichy, UCSD ©2003
Storage Elements
LatchesLevel sensitive – transparent when H, hold when L
ckb
d
ck
qp_q
ck
q
d
ck
qdck
q
d
Flip-flopsEdge-triggered – data is sampled at the clock edge
CSE241 L3 ASICs.27 Kahng & Cichy, UCSD ©2003
Latch and Flip-Flop Gates
in out
enable
enable
Active high latch
clock
D QN
Q
clock
clockclock
Rising edge flip-flop
clock
D QN
Q
clock
clock
clock
clock
clock
clock
clock
out
enable
enable
in
Latch and flip-flop schematics from TSMC 0.13um LV Artisan Sage-X Standard Cell Library.
CSE241 L3 ASICs.28 Kahng & Cichy, UCSD ©2003
Latch and Flip-Flop Behavior
Active high latch Rising edge flip-flopWhen clock is high
D QN
Q
D QN
Q
D QN
Q
D QN
Q
When clock is low When clock is low
When clock is high
tDQ 2 inverter delays tCQ 4 inverter delays
CSE241 L3 ASICs.29 Kahng & Cichy, UCSD ©2003
(a)
(b)
(c)
clock at B
clock at B
A B
T – tj
clock
tj/2 tj/2
Thigh – tduty tduty
clock at Bclock at B
tsk,AB
clock at Aclock at B
tsk,AB
Clock Skew and Jitter
Cycle-to-cycle edge jitter
Duty cycle jitter
Clock skew
CSE241 L3 ASICs.30 Kahng & Cichy, UCSD ©2003
Flip-Flop Timing Characteristics
Rising edge flip-flop
non-idealclock
tCQmax tcomb,max tsutsk+tj
Tflip-flops
non-idealclock
clock
tsk
tCQ,min
th
tcomb,min
A
B
A B
A
B
Setup time constraint Hold time constraint
CSE241 L3 ASICs.31 Kahng & Cichy, UCSD ©2003
Latch Setup Time and Transparency
clock
tCQ tcomb,max tsu tsk+tjtduty
non-idealclock
clock
tcomb
non-idealclock
tDQ tDQ
A BA B
AB
AB
Active high latch
Setup time constraint No penalty to clock period for setup time constraint!
CSE241 L3 ASICs.32 Kahng & Cichy, UCSD ©2003
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
CSE241 L3 ASICs.33 Kahng & Cichy, UCSD ©2003
Setup Time
Important characteristics of storage elementsSetup time, hold time, clock-to-q delay
Setup time, tsuTime before the clock edge that the data must arrive in order for the new data to be storedThe setup time for a F/F occurs before the latching edge.The setup time for a Latch occurs before the transition from transparent to hold
ck
d
tsetup
q
CSE241 L3 ASICs.34 Kahng & Cichy, UCSD ©2003
Hold TimeA second important characteristic is the hold time, th
Time after the clock edge that the data must remain in order to the data to be properly heldNote that Hold time (and Setup time) can be negative
Why isn’t hold time just the negative of setup time?Storage elements typically have some data dependence
- Capacitances, and devices may be faster for one data value versus another
Specify the worst case for process technology and operating condition variations
ck
d
q
thold
CSE241 L3 ASICs.35 Kahng & Cichy, UCSD ©2003
Clocking OverheadInherent delay in any storage element
The delay is measured from Clock transition to Output data transition, tc2q
Input data transition to Output data transition, td2q
Flip-flop is edge triggeredThe overhead is tc2q + tsu
Latch is level-sensitiveThe overhead is td2q
ck
d
tc2q
q
td2q
CSE241 L3 ASICs.36 Kahng & Cichy, UCSD ©2003
Clock Skew
Most “high-profile” of clock network metrics
Maximum difference in arrival times of clock signal to any 2 latches/FF’s fed by the network
Skew = max | t1 – t2 |
Clock Source (ex. PLL)
CLK1
CLK2
Skew
Time
Time
Time
t1 t2
Latency
Fig. From Zarkesh-HaSylvester / Shepard, 2001
CSE241 L3 ASICs.37 Kahng & Cichy, UCSD ©2003
Clock Skew Causes
Designed (unavoidable) variations – mismatch in buffer load sizes, interconnect lengths
Process variation – process spread across die yielding different Leff, Tox, etc. values
Temperature gradients – changes MOSFET performance across die
IR voltage drop in power supply – changes MOSFET performance across die
Note: Delay from clock generator to fan-out points (clock latency) is not important by itself
BUT: increased latency leads to larger skew for same amount of relative variation
Sylvester / Shepard, 2001
CSE241 L3 ASICs.38 Kahng & Cichy, UCSD ©2003
Clock Jitter
Clock network delay uncertaintyFrom one clock cycle to the next, the period is not exactly the same each timeMaximum difference in phase of clock between any two periods is jitterMust be considered in max path (setup) timing; typically O(50ps) for high-end designs
Sylvester / Shepard, 2001
CSE241 L3 ASICs.39 Kahng & Cichy, UCSD ©2003
Clock Jitter Causes
PLL oscillation frequency
Various noise sources affecting clock generation and distribution
E.g., power supply noise dynamically alters drive strength of intermediate buffer stagesJitter reduced by minimizing IR and L*(di/dt) noise
Courtesy Cypress Semi
Sylvester / Shepard, 2001
CSE241 L3 ASICs.40 Kahng & Cichy, UCSD ©2003
Clocking Methodology (Edge-Triggered)
Max(tpd) < tper – tsu – tc2q – tskewDelay is too long for data to be captured
Min(tpd) > th-tc2q+tskewDelay is too short and data can race through, skipping a state
FlipFlop
tper
Comb
Logic
Comb
Logic
CSE241 L3 ASICs.41 Kahng & Cichy, UCSD ©2003
Example of tpdmax Violation
Suppose there is skew between the registers in a dataflow (regA after regB)
“i” gets its input values from regA at transition in Ck’
CL output “o” arrives after Ck transition due to skew
To correct this problem, can increase cycle time
i
o
regA
regB
tpdmax
Ck’ Ck
CkCk’
i o
tskew
Too late!
tpdmax
Comb
Logic
Comb
Logic
CSE241 L3 ASICs.42 Kahng & Cichy, UCSD ©2003
Example of tpdmin Violation: Race ThroughSuppose clock skew causes regA to be clocked before regB
“i” passes through the CL with little delay (tpdmin)
“o” arrives before the rising Ck’ causes the data to be latched
This problem cannot be fixed by changing frequency have a rock instead of a chip
i
oregA
regB
tpdmin
Ck Ck’
CkCk’
i o
tskew
Too early!
tpdmin
Comb
Logic
Comb
Logic
CSE241 L3 ASICs.43 Kahng & Cichy, UCSD ©2003
Time Borrowing (Cycle Stealing)
Cycle steal with flip-flops using delayed clocks
FlipFlop
FlipFlop
tpd < tper + tw
Intentional delay = skewLatch
Latch
tpd > tper
Give it back in later stages
Ck
Ck
Tpd is safely > tpdmin
Time borrowing with latches
Comb
Logic
Comb
Logic
Comb
Logic
Comb
LogicComb
Logic
Comb
Logic
CSE241 L3 ASICs.44 Kahng & Cichy, UCSD ©2003
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
CSE241 L3 ASICs.45 Kahng & Cichy, UCSD ©2003
Clock Distribution
General goal of clock distributionDeliver clock to all memory elements with acceptable skewDeliver clock edges with acceptable sharpness
Clocking network design is one of the greatest challenges in the design of a large chip
Clocks generally distributed via wiring trees (and meshes)
Low-resistance interconnect to minimize delay
Multiple drivers to distribute driver requirementsUse optimal sizing principles to design buffersClock lines can create significant crosstalk
CSE241 L3 ASICs.46 Kahng & Cichy, UCSD ©2003
Clock Distribution Problem StatementObjective
Minimum skew (performance and hold time issues)Minimum cell area and metal use(sometimes) minimal latency(sometimes) particular latency(sometimes) intermixed gating for power reduction(sometimes) hold to particular duty cycle: e.g. 50:50 +- 1 percent
Subject to:Process variation from lot-to-lotProcess variation across the dieRadically different loading (ff density) around the dieMetal variation across the diePower variation across the die (both static IR and dynamic)Coupling (same and other layers)
CSE241 L3 ASICs.47 Kahng & Cichy, UCSD ©2003
Issues in Clock Distribution Network Design
Skew Process, voltage, and temperatureData dependenceNoise couplingLoad balancing
Power, CV2f – (no ½ or α)Clock gating
Flexibility/TunabilityCompactness – fit into existing layout/design
ReliabilityElectromigration
CSE241 L3 ASICs.48 Kahng & Cichy, UCSD ©2003
Skew: Clock Delay Varies With Position
CSE241 L3 ASICs.49 Kahng & Cichy, UCSD ©2003
Clock Distribution Methods
RC-TreeLess capacitanceMore accuracyFlexible wiring
GridsReliableLess data dependencyTunable (late in design)
Shown here for final stage drivers driving F/F loads
CSE241 L3 ASICs.50 Kahng & Cichy, UCSD ©2003
RC-Trees
H-Tree X-Tree Binary-Tree
Asymmetric trees can and are used due to uneven sink distribution, hard macros in floorplan ( hierarchical clock distribution), etc.; the basic goal is to have even RC delays
CSE241 L3 ASICs.51 Kahng & Cichy, UCSD ©2003
Grids
Gridded clock distribution common on earlier DEC Alpha microprocessors
Advantages:Skew determined by grid density, not too sensitive to load positionClock signals available everywhereTolerant to process variationsUsually yields extremely low skew values
Disadvantages:Huge amount of wiring and powerTo minimize such penalties, need to make grid pitch coarser lose the grid advantage
Pre-drivers
Global grid
Sylvester / Shepard, 2001
CSE241 L3 ASICs.52 Kahng & Cichy, UCSD ©2003
Trees
H-tree (Bakoglu)One large central driver, recursive structure to match wirelengthsHalve wire width at branching points to reduce reflections
DisadvantagesSlew degradation along long RC pathsUnrealistically large central driver
- Clock drivers can create large temperature gradients (ex. Alpha 21064 ~30° C)
Non-uniform load distributionInherently non-scalable (wire R growth) Partial solution: intermediate buffers at branching points
courtesy of P. Zarkesh-Ha
Sylvester / Shepard, 2001
CSE241 L3 ASICs.53 Kahng & Cichy, UCSD ©2003
Buffered Tree
L2
WGBuf EGBuf
NGBuf
SGBuf
L3
PLL
Drives all clock loads within its region
Other regions of the chip
Sylvester / Shepard, 2001
CSE241 L3 ASICs.54 Kahng & Cichy, UCSD ©2003
Buffered H-tree
AdvantagesIdeally zero-skewCan be low power (depending on skew requirements)Low area (silicon and wiring)CAD tool friendly (regular)
DisadvantagesSensitive to process variationsLocal clocking loads inherently non-uniform
Sylvester / Shepard, 2001
CSE241 L3 ASICs.55 Kahng & Cichy, UCSD ©2003
Tree Balancing
Some techniques:a) Introduce dummy loads
b) Snaking of wirelength to match delays
Con: Routing area often more valuable than Silicon
Sylvester / Shepard, 2001
CSE241 L3 ASICs.56 Kahng & Cichy, UCSD ©2003
Examples of Distribution
H-Tree, Asymmetric RC-Tree (IBM)
GridsDEC [Alphas]
SerpentinesIntel x86[Young ISSCC97]
CSE241 L3 ASICs.57 Kahng & Cichy, UCSD ©2003
Examples From Processor Chips
DEC-Alpha 21064 clock spinesDEC-Alpha 21064 RC delays
DEC-Alpha 21164 RC delays for Global Distribution (Spine + Grid)
DEC-Alpha 21164 RC local delays
CSE241 L3 ASICs.58 Kahng & Cichy, UCSD ©2003
ReShape Clocks Example
Balanced, shielded H-tree for pre-clock distribution
Mesh for Block level distribution
CSE241 L3 ASICs.59 Kahng & Cichy, UCSD ©2003
output mesh
Pre-clock 2 Level H-tree
All routes 5-6u M6/5, shielded with 1u grounds
~10 buffers per node
output mesh must hit every sub-block
CSE241 L3 ASICs.60 Kahng & Cichy, UCSD ©2003
Block Level Mesh (.18u)
Max 600u stride
1u m5 ribs every 20 - 30 u (4 to 6 rows)
Shielded input and output m6 shorting straps
Clumps of 1-6 clock buffers, surrounded by capacitor pads
Pre-clock connects to input shorting straps
CSE241 L3 ASICs.61 Kahng & Cichy, UCSD ©2003
Problems with Meshes
Burn more power at low frequencies
Blocks more routing resources (solution, integrated power distribution with ribs can provide shielding for ‘free’)
Difficult for ‘spare’ clock domains that will not tolerate regioning
Post placement (and routing) tuning required
No ‘beneficial skew’ (shudder) possible
CSE241 L3 ASICs.62 Kahng & Cichy, UCSD ©2003
Problems with Meshes (#2)
Clock gating only easy at root
Fighting tools to do analysis:Clumped buffers a problem in Static Timing Analysis toolsLarge shorted meshes a problem for STA tools
Need Full extractions and Spice-Like simulation (e.g. Avant! Star-Sim) to determine skew
CSE241 L3 ASICs.63 Kahng & Cichy, UCSD ©2003
Benefits of Meshes (#3)
Deterministic since shielded all the way down to rib distribution
No ecoplacement required: all buffers preplaced before block placement
Low latency since uses shorted drivers, therefore lower skew
Ecoplacements of FFs later do not require rebalance of tree
“Idealized” clocking environment for concurrent RTL design and timing convergence dance.
CSE241 L3 ASICs.64 Kahng & Cichy, UCSD ©2003
Mesh Example
~ 100k flops
6 blocks
CSE241 L3 ASICs.65 Kahng & Cichy, UCSD ©2003
Clock Skew Thermal Map
Pre-tuning
CSE241 L3 ASICs.66 Kahng & Cichy, UCSD ©2003
Clock Skew Thermal Map #2
50ps block/ 100ps global skew, post tuning
CSE241 L3 ASICs.67 Kahng & Cichy, UCSD ©2003
Alternative Clock Network Strategy
Globally – Tree
Power requirements reduced relative to global grid
Smaller routing requirements, frees up global tracks
Trees balanced easily at global level
Keeps global skew low (with minimal process variation)
Sylvester / Shepard, 2001
CSE241 L3 ASICs.68 Kahng & Cichy, UCSD ©2003
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
CSE241 L3 ASICs.69 Kahng & Cichy, UCSD ©2003
Skew Reduction Using Package
• Most clock network latency occurs at global level (largest distances spanned)
• Latency ∝ Skew
• With reverse scaling, routing low-RC signals at global level becomes more difficult & area-consuming
Sylvester / Shepard, 2001
CSE241 L3 ASICs.70 Kahng & Cichy, UCSD ©2003
System clock
µP/ASIC Solder bump
substrate
⇒ Incorporate globalclock distribution into the package
⇒ Flip-chip packaging allows for high density, low parasitic access from substrate to IC
• RC of package-level wiring up to 4 orders of magnitude smaller than on-chip wiring
• Global skew reduced
• Lower capacitance lower power
• Opens up global routing tracks
• Results not yet conclusive
Skew Reduction Using Package
Sylvester / Shepard, 2001
CSE241 L3 ASICs.71 Kahng & Cichy, UCSD ©2003
Useful Skew (= cycle-stealing)
FF fast FF FFslow
Zero skew
hold setup hold setup
Timing Slacks
FF fast FF FFslow
Useful skew
hold setup hold setup
Useful skew• Local skew constraints• Shift slack to critical paths
Zero skew• Global skew constraint• All skew is bad
W. Dai, UC Santa Cruz
CSE241 L3 ASICs.72 Kahng & Cichy, UCSD ©2003
Skew = Local Constraint
D : longest pathd : shortest pathFF FF
safe
Skew
race condition cycle time violation
-d + thold Tperiod - D - tsetup< <
permissible range
Timing is correct as long as the signal arrives in the permissible skew range
W. Dai, UC Santa Cruz
CSE241 L3 ASICs.73 Kahng & Cichy, UCSD ©2003
Skew Scheduling for Design Robustness
“0 0 0”: at verge of violation
FF FF FF2 ns 6 ns T = 6 ns
“2 0 2”: more safety margin4 0
-22
4 0
Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on the edge
W. Dai, UC Santa Cruz
CSE241 L3 ASICs.74 Kahng & Cichy, UCSD ©2003
Potential Advantages of Useful Skew
CLK
0-skew
CLK
U-skew
Reduce peak current consumption by distributing the FF switch point in the range of permissible skew
Can exploit extra margin to increase clock frequency or reduce sizing (= power)
W. Dai, UC Santa Cruz
CSE241 L3 ASICs.75 Kahng & Cichy, UCSD ©2003
Conventional Zero-Skew Flow
PlacementPlacement
SynthesisSynthesis
Extraction & Delay CalculationExtraction & Delay Calculation
Static Timing AnalysisStatic Timing Analysis
0-Skew Clock Synthesis0-Skew Clock Synthesis
Clock RoutingClock Routing
Signal RoutingSignal Routing
W. Dai, UC Santa Cruz
CSE241 L3 ASICs.76 Kahng & Cichy, UCSD ©2003
Useful-Skew Flow
Existing PlacementExisting Placement
Extraction & Delay CalculationExtraction & Delay Calculation
Static Timing AnalysisStatic Timing Analysis
U-Skew Clock SynthesisU-Skew Clock Synthesis
Clock RoutingClock Routing
Signal RoutingSignal Routing
Permissible range generationPermissible range generation
Initial skew schedulingInitial skew scheduling
Clock tree topology synthesisClock tree topology synthesis
Clock net routingClock net routing
Clock timing verificationClock timing verification
W. Dai, UC Santa Cruz
CSE241 L3 ASICs.77 Kahng & Cichy, UCSD ©2003
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and used-skew degrees of freedom
Clock power issues
Gate timing models
CSE241 L3 ASICs.78 Kahng & Cichy, UCSD ©2003
Power consumption in clocks due to:Clock driversLong interconnectionsLarge clock loads – all clocked elements (latches, FF’s) are driven
Different components dominateDepending on type of clock network usedEx. Grid – huge pre-drivers & wire cap. drown out load cap.
Clock Power
Sylvester / Shepard, 2001
CSE241 L3 ASICs.79 Kahng & Cichy, UCSD ©2003
Clock Power Is LARGE
Not only is the clock capacitance large, it switches every cycle!
P = α C Vdd2 f
Sylvester / Shepard, 2001
CSE241 L3 ASICs.80 Kahng & Cichy, UCSD ©2003
Low-Power Clocking
Gated clocksGated clocksPrevent switching in areas of chip not being usedPrevent switching in areas of chip not being usedEasier in static designsEasier in static designs
EdgeEdge--triggered flops in ARM rather than transparent latches triggered flops in ARM rather than transparent latches in Alphain Alpha
Reduced load on clock for each latch/flopReduced load on clock for each latch/flopEliminated spurious powerEliminated spurious power--consuming transitions during latch flowconsuming transitions during latch flow--throughthrough
Sylvester / Shepard, 2001
CSE241 L3 ASICs.81 Kahng & Cichy, UCSD ©2003
Clock Area
Clock networks consume silicon area (clock drivers, PLL, etc.) and routing area
Routing area is most vital
Top-level metals are used to reduce RC delaysThese levels are precious resources (unscaled)Power routing, clock routing, key global signals
Reducing area also reduces wiring capacitance and power
Typical #’s: Intel Itanium – 4% of M4/5 used in clock routing
Sylvester / Shepard, 2001
CSE241 L3 ASICs.82 Kahng & Cichy, UCSD ©2003
Clock Slew Rates
To maintain signal integrity and latch performance, minimum slew rates are required
Too slow – clock is more susceptible to noise, latches are slowed down, setup times eat into timing budget [Tsetup = 200 + 0.33 * Tslew(ps)], more short-circuit power for large clock driversToo fast – burns too much power, overdesigned network, enhanced ground bounce
Rule-of-thumb: Trise and Tfall of clock are each between 10-20% of clock period (10% - aggressive target)
1 GHz clock; Trise = Tfall = 100-200ps
Sylvester / Shepard, 2001
CSE241 L3 ASICs.83 Kahng & Cichy, UCSD ©2003
Example: Alpha 21264
Grid + H-tree approach
Power = 32% of total
Wire usage = 3% of metals 3 & 4
4 major clock quadrants, each with a large driver connected to local grid structures
Sylvester / Shepard, 2001
CSE241 L3 ASICs.84 Kahng & Cichy, UCSD ©2003
Alpha 21264 Skew Map
Ref: Compaq, ASP-DAC00
Sylvester / Shepard, 2001
CSE241 L3 ASICs.85 Kahng & Cichy, UCSD ©2003
Power vs. Skew
Fundamental design decisionMeeting skew requirements is easy with unlimited power budget
Wide wires reduce RC product but increase total CDriver upsizing reduces latency ( reduces skew as well) but increases buffer cap
SOC context: plastic package power limit is 2-3 W
Sylvester / Shepard, 2001
CSE241 L3 ASICs.86 Kahng & Cichy, UCSD ©2003
Clock Distribution Trends
TimingClock period dropping fast, skew must followSlew rates must also scale with cycle timeJitter – PLL’s get better with CMOS scaling but other sources of noise increase
- Power supply noise more important- Switching-dependent temperature gradients
MaterialsCu reduces RC slew degradation, potential skewLow-k decreases power, improves latency, skew, slews
PowerComplexity, dynamic logic, pipelining more clock sinksLarger chips bigger clock networks
Sylvester / Shepard, 2001
CSE241 L3 ASICs.87 Kahng & Cichy, UCSD ©2003
OutlineClocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
CSE241 L3 ASICs.88 Kahng & Cichy, UCSD ©2003
Gate Timing Characterization
“Extract” exact transistor characteristics from layoutTransistor width, length, junction area and perimeterLocal wire length and inter-wire distance
Compute all transistor and wire capacitances
CL DA
B
F
CL
CSE241 L3 ASICs.89 Kahng & Cichy, UCSD ©2003
Cell Timing Characterization
Delay tables generated using a detailed transistor-level circuit simulator SPICE (differential-equations solver)
For a number of different input slews and load capacitances simulate the circuit of the cell
Propagation time (50% Vdd at input to 50% at output)Output slew (10% Vdd at output to 90% Vdd at output)
Time
tslew
tpd
Vdd
CSE241 L3 ASICs.90 Kahng & Cichy, UCSD ©2003
Non-linear effects reflected in tables
InputSlew
InputSlew
Delay at the gate
OutputCapacitance
OutputCapacitance
OutputSlew
IntrinsicDelay
Resulting waveform
DG = f (CL, Sin) and Sout = f (CL, Sin)Non-linear
Interpolate between table entries
Interpolation error is usually below 10% of SPICE
CSE241 L3 ASICs.91 Kahng & Cichy, UCSD ©2003
Conservatism of Gate Delay Modeling
True gate delay depends on input arrival time patterns
STA will assume that only 1 input is switchingWill use worst slope among several inputs
Time
A B Ftpd
Time
A Ftpd
Vdd
Vdd
DA
B
F
CLD
A
B
F
CL