CSE241 VLSI Digital Circuits Winter 2003 Lecture 06:...

CSE241 L3 ASICs.1 Kahng & Cichy, UCSD ©2003

CSE241VLSI Digital Circuits

Winter 2003

Lecture 06: Timing


This Class + Logistics

TimingFlip-flop timing

Clock distribution

Clock tree synthesis

Reading: White papers on static timing analysis, papers on clock tree synthesis

Lab #2 due date: Monday January 27th

Slide courtesy of S. P. Levitan, U. Pittsburg


Review

Static timing analysis (Lecture 4)Pin-based timing graph

Directed acyclic graph (DAG) of timing arcs

Longest path in DAG time linear in #arcs (edges)

Slack = required arrival time – actual arrival time (long path analysis)

Logic synthesis (Lecture 5)

Slide courtesy of S. P. Levitan, U. Pittsburg


Static Analysis vs. Dynamic Analysis

c=0 c=1b=0 a-z delay1 a-z delay2 b=1 a-z delay3 a-z delay4

a

b

c

z

Why static analysis when dynamic simulation is more accurate?

Drawbacks of simulationRequires input vectors (stimuli for circuit)

Long runtimes

Example: calculate worst-case rising delay from a to zExponential explosion with number of possible design input states


90

10Time

Vdd50

STA Terminology(Actual) arrival time (AAT, or AT) = time at which a pin switches state

Usually 50% point on voltage curve, i.e., AT = t50

Slew time = time over which signal switchesUsually difference between 10% and 90% on voltage curve, i.e., tslew = t90 – t10

Required arrival time (RAT) = time at which a signal must arrive in order to avoid a chip fail

Slack = RAT – AATPositive slack good (= margin), negative slack bad


d=2

d=1

d=5

d=3

d=2

d=1

d=3

d=3d=1

temp at=3 temp at=7

at=0

at=0

at=0

at=1

at=2

at=5 at=6

at=5

at=8at=11

rat=10

Slack= -1

Example: What is slack at PO?


d=2

d=1

d=5

d=3

d=2

d=1

d=3

d=3d=1

temp at=3 temp at=7

at=0

at=0

at=0

at=1

at=2

at=5 at=6

at=5

at=8at=11

rat=10

Slack = 0

Example: Incremental Timing Analysis

at=10

d=1

d=1d=1

at=3

at=7

Amount of work is bounded by sizes of fanin, fanout cones of logic


0=aAT1=bAT

2=xRAT

1=xAT121 −=−=xSL101 =−=bSL

000 =−=aSL1=yAT

0=cAT

011 =−=ySL

a

b xc

y

Definitions change as followsRAT = lower bound on arrival timePropagate shortest possible instead of longest possible delaysSlack = Arrival – Required

Example: negative slack because ATc is too small (early)

1 1

110 −=−=cSL

Early-Mode Analysis


Enhancements of STA

Incremental timing analysis

Nanometer-scale process effects – variation (probabilistic timing analysis)

Interference – crosstalk

Multiple inputs switching

Conservatism of delay propagation

HW #8: Suppose you change the size of one (combinational) gate in your design, thus invalidating the previous timing analysis. How much work must be done to regain a correct timing analysis?

Courtesy K. Keutzer et al. UCB


Timing Correction

Driven by STA“Incremental performance analysis backplane”

Fix electrical violationsResize cellsBuffer netsCopy (clone) cells

Fix timing problemsLocal transforms (bag of tricks)Path-based transforms

DAC-2002, Physical Chip Implementation


Local Synthesis Transforms

Resize cells

Buffer or clone to reduce load on critical nets

Decompose large cells

Swap connections on commutative pins or among equivalent nets

Move critical signals forward

Pad early paths

Area recovery



Transform Example

Delay = 4

…..

Double Inverter

Removal

…..

…..

Delay = 2



Resizing

00.010.020.030.040.05

0 0.2 0.4 0.6 0.8 1load

d

A B C

b

ad

e

f0.2

0.2

0.3

?

b

aA

0.035

b

aC

0.026



Cloning

00.010.020.030.040.05

0 0.2 0.4 0.6 0.8 1load

d

A B C

b

a

d

e

f

gh

0.2

0.2

0.20.20.2

?

b

a

d

ef

gh

A

B



Buffering

00.010.020.030.040.05

0 0.2 0.4 0.6 0.8 1load

d

A B C

b

a

d

e

f

gh

0.2

0.2

0.20.20.2

? b

a

d

e

f

gh

0.1

0.2

0.20.20.2

BB

0.2



Redesign Fan-in Tree

a

cd

b eArr(b)=3

Arr(c)=1

Arr(d)=0

Arr(a)=4

Arr(e)=61

1

1

cd

e

Arr(e)=51

1b1

a



Redesign Fan-out Tree

1

1

1

3

1

1

1

Longest Path = 5

1

1

1

3

1

2

Longest Path = 4Slowdown of buffer due to load



Decomposition



Swap Commutative Pins

2

c

ab

2

1

0 1

1

1

3

a

cb

2

1

0

1

1

2

1 5

Simple sorting on arrival times and delay works



OutlineClocking

Storage elements

Clocking metrics and methodology

Clock distribution

Package and useful-skew degrees of freedom

Clock power issues

Gate timing models


Why Clocks?Clocks provide the means to synchronize

By allowing events to happen at known timing boundaries, we can sequence these events

Greatly simplifies building of state machines

No need to worry about variable delay through combinational logic (CL)

All signals delayed until clock edge (clock imposes the worst case delay)

CombLogic

register

CombLogic

register

registerDataflowFSM


Clock Cycle Time

Cycle time is determined by the delay through the CLSignal must arrive before the latching edgeIf too late, it waits until the next cycle

- Synchronization and sequential order becomes incorrect

tcycle > tprop_delay + toverhead

Can change circuit architecture to obtain smaller TcyclePipeliningParallelism


PipeliningFor dataflow:

Instead of a long critical path, split the critical path into chunks Insert registers to store intermediate resultsThis allows 2 waves of data to coexist within the CL

Can we extend this ad infinitum?Overhead eventually limits the pipelining

- E.g., 1.5 to 2 gate delays for latch or FFGranularity limits as well

- Minimum time quantum: delay of a gate

register

register

register

register

register

tpd tpd1 tpd2

tcycle > tpd + toverhead tcycle > max(tpd1, tpd2) + toverhead

CL

A+B

CL

A+BCL

A

CL

ACL

B

CL

B


Parallelism

For FSMs:Same functionality and performance can be achieved at half the clock rateHowever, the input and output signals must be doubled (to account for the outputs for each original cycle)Instead of doubling the delay, the optimized logic is often logarithmically related to the degree of parallelism

register

tpd

tcycle1 > tpd + tov

M-bits

reg

tpd

tcycle2 > Ntpd + tov

M-bits

tpd

reg

M-bits register

tpd

2*M-bits

tcycle3 > log(Ntpd) + tov

CLCLCLCL CLCL

Opt.

CL

Opt.

CL


OutlineClocking

Storage elements


Clock distribution


Clock power issues

Gate timing models


Storage Elements

LatchesLevel sensitive – transparent when H, hold when L

ckb

d

ck

qp_q

ck

q

d

ck

qdck

q

d

Flip-flopsEdge-triggered – data is sampled at the clock edge


Latch and Flip-Flop Gates

in out

enable

enable

Active high latch

clock

D QN

Q

clock

clockclock

Rising edge flip-flop

clock

D QN

Q

clock

clock

clock

clock

clock

clock

clock

out

enable

enable

in

Latch and flip-flop schematics from TSMC 0.13um LV Artisan Sage-X Standard Cell Library.


Latch and Flip-Flop Behavior

Active high latch Rising edge flip-flopWhen clock is high

D QN

Q

D QN

Q

D QN

Q

D QN

Q

When clock is low When clock is low

When clock is high

tDQ 2 inverter delays tCQ 4 inverter delays


(a)

(b)

(c)

clock at B

clock at B

A B

T – tj

clock

tj/2 tj/2

Thigh – tduty tduty

clock at Bclock at B

tsk,AB

clock at Aclock at B

tsk,AB

Clock Skew and Jitter

Cycle-to-cycle edge jitter

Duty cycle jitter

Clock skew


Flip-Flop Timing Characteristics

Rising edge flip-flop

non-idealclock

tCQmax tcomb,max tsutsk+tj

Tflip-flops

non-idealclock

clock

tsk

tCQ,min

th

tcomb,min

A

B

A B

A

B

Setup time constraint Hold time constraint


Latch Setup Time and Transparency

clock

tCQ tcomb,max tsu tsk+tjtduty

non-idealclock

clock

tcomb

non-idealclock

tDQ tDQ

A BA B

AB

AB

Active high latch

Setup time constraint No penalty to clock period for setup time constraint!


OutlineClocking

Storage elements


Clock distribution


Clock power issues

Gate timing models


Setup Time

Important characteristics of storage elementsSetup time, hold time, clock-to-q delay

Setup time, tsuTime before the clock edge that the data must arrive in order for the new data to be storedThe setup time for a F/F occurs before the latching edge.The setup time for a Latch occurs before the transition from transparent to hold

ck

d

tsetup

q


Hold TimeA second important characteristic is the hold time, th

Time after the clock edge that the data must remain in order to the data to be properly heldNote that Hold time (and Setup time) can be negative

Why isn’t hold time just the negative of setup time?Storage elements typically have some data dependence

- Capacitances, and devices may be faster for one data value versus another

Specify the worst case for process technology and operating condition variations

ck

d

q

thold


Clocking OverheadInherent delay in any storage element

The delay is measured from Clock transition to Output data transition, tc2q

Input data transition to Output data transition, td2q

Flip-flop is edge triggeredThe overhead is tc2q + tsu

Latch is level-sensitiveThe overhead is td2q

ck

d

tc2q

q

td2q


Clock Skew

Most “high-profile” of clock network metrics

Maximum difference in arrival times of clock signal to any 2 latches/FF’s fed by the network

Skew = max | t1 – t2 |

Clock Source (ex. PLL)

CLK1

CLK2

Skew

Time

Time

Time

t1 t2

Latency

Fig. From Zarkesh-HaSylvester / Shepard, 2001


Clock Skew Causes

Designed (unavoidable) variations – mismatch in buffer load sizes, interconnect lengths

Process variation – process spread across die yielding different Leff, Tox, etc. values

Temperature gradients – changes MOSFET performance across die

IR voltage drop in power supply – changes MOSFET performance across die

Note: Delay from clock generator to fan-out points (clock latency) is not important by itself

BUT: increased latency leads to larger skew for same amount of relative variation

Sylvester / Shepard, 2001


Clock Jitter

Clock network delay uncertaintyFrom one clock cycle to the next, the period is not exactly the same each timeMaximum difference in phase of clock between any two periods is jitterMust be considered in max path (setup) timing; typically O(50ps) for high-end designs



Clock Jitter Causes

PLL oscillation frequency

Various noise sources affecting clock generation and distribution

E.g., power supply noise dynamically alters drive strength of intermediate buffer stagesJitter reduced by minimizing IR and L*(di/dt) noise

Courtesy Cypress Semi



Clocking Methodology (Edge-Triggered)

Max(tpd) < tper – tsu – tc2q – tskewDelay is too long for data to be captured

Min(tpd) > th-tc2q+tskewDelay is too short and data can race through, skipping a state

FlipFlop

tper

Comb

Logic

Comb

Logic


Example of tpdmax Violation

Suppose there is skew between the registers in a dataflow (regA after regB)

“i” gets its input values from regA at transition in Ck’

CL output “o” arrives after Ck transition due to skew

To correct this problem, can increase cycle time

i

o

regA

regB

tpdmax

Ck’ Ck

CkCk’

i o

tskew

Too late!

tpdmax

Comb

Logic

Comb

Logic


Example of tpdmin Violation: Race ThroughSuppose clock skew causes regA to be clocked before regB

“i” passes through the CL with little delay (tpdmin)

“o” arrives before the rising Ck’ causes the data to be latched

This problem cannot be fixed by changing frequency have a rock instead of a chip

i

oregA

regB

tpdmin

Ck Ck’

CkCk’

i o

tskew

Too early!

tpdmin

Comb

Logic

Comb

Logic


Time Borrowing (Cycle Stealing)

Cycle steal with flip-flops using delayed clocks

FlipFlop

FlipFlop

tpd < tper + tw

Intentional delay = skewLatch

Latch

tpd > tper

Give it back in later stages

Ck

Ck

Tpd is safely > tpdmin

Time borrowing with latches

Comb

Logic

Comb

Logic

Comb

Logic

Comb

LogicComb

Logic

Comb

Logic


OutlineClocking

Storage elements


Clock distribution


Clock power issues

Gate timing models


Clock Distribution

General goal of clock distributionDeliver clock to all memory elements with acceptable skewDeliver clock edges with acceptable sharpness

Clocking network design is one of the greatest challenges in the design of a large chip

Clocks generally distributed via wiring trees (and meshes)

Low-resistance interconnect to minimize delay

Multiple drivers to distribute driver requirementsUse optimal sizing principles to design buffersClock lines can create significant crosstalk


Clock Distribution Problem StatementObjective

Minimum skew (performance and hold time issues)Minimum cell area and metal use(sometimes) minimal latency(sometimes) particular latency(sometimes) intermixed gating for power reduction(sometimes) hold to particular duty cycle: e.g. 50:50 +- 1 percent

Subject to:Process variation from lot-to-lotProcess variation across the dieRadically different loading (ff density) around the dieMetal variation across the diePower variation across the die (both static IR and dynamic)Coupling (same and other layers)


Issues in Clock Distribution Network Design

Skew Process, voltage, and temperatureData dependenceNoise couplingLoad balancing

Power, CV2f – (no ½ or α)Clock gating

Flexibility/TunabilityCompactness – fit into existing layout/design

ReliabilityElectromigration


Skew: Clock Delay Varies With Position


Clock Distribution Methods

RC-TreeLess capacitanceMore accuracyFlexible wiring

GridsReliableLess data dependencyTunable (late in design)

Shown here for final stage drivers driving F/F loads


RC-Trees

H-Tree X-Tree Binary-Tree

Asymmetric trees can and are used due to uneven sink distribution, hard macros in floorplan ( hierarchical clock distribution), etc.; the basic goal is to have even RC delays


Grids

Gridded clock distribution common on earlier DEC Alpha microprocessors

Advantages:Skew determined by grid density, not too sensitive to load positionClock signals available everywhereTolerant to process variationsUsually yields extremely low skew values

Disadvantages:Huge amount of wiring and powerTo minimize such penalties, need to make grid pitch coarser lose the grid advantage

Pre-drivers

Global grid



Trees

H-tree (Bakoglu)One large central driver, recursive structure to match wirelengthsHalve wire width at branching points to reduce reflections

DisadvantagesSlew degradation along long RC pathsUnrealistically large central driver

- Clock drivers can create large temperature gradients (ex. Alpha 21064 ~30° C)

Non-uniform load distributionInherently non-scalable (wire R growth) Partial solution: intermediate buffers at branching points

courtesy of P. Zarkesh-Ha



Buffered Tree

L2

WGBuf EGBuf

NGBuf

SGBuf

L3

PLL

Drives all clock loads within its region

Other regions of the chip



Buffered H-tree

AdvantagesIdeally zero-skewCan be low power (depending on skew requirements)Low area (silicon and wiring)CAD tool friendly (regular)

DisadvantagesSensitive to process variationsLocal clocking loads inherently non-uniform



Tree Balancing

Some techniques:a) Introduce dummy loads

b) Snaking of wirelength to match delays

Con: Routing area often more valuable than Silicon



Examples of Distribution

H-Tree, Asymmetric RC-Tree (IBM)

GridsDEC [Alphas]

SerpentinesIntel x86[Young ISSCC97]


Examples From Processor Chips

DEC-Alpha 21064 clock spinesDEC-Alpha 21064 RC delays

DEC-Alpha 21164 RC delays for Global Distribution (Spine + Grid)

DEC-Alpha 21164 RC local delays


ReShape Clocks Example

Balanced, shielded H-tree for pre-clock distribution

Mesh for Block level distribution


output mesh

Pre-clock 2 Level H-tree

All routes 5-6u M6/5, shielded with 1u grounds

~10 buffers per node

output mesh must hit every sub-block


Block Level Mesh (.18u)

Max 600u stride

1u m5 ribs every 20 - 30 u (4 to 6 rows)

Shielded input and output m6 shorting straps

Clumps of 1-6 clock buffers, surrounded by capacitor pads

Pre-clock connects to input shorting straps


Problems with Meshes

Burn more power at low frequencies

Blocks more routing resources (solution, integrated power distribution with ribs can provide shielding for ‘free’)

Difficult for ‘spare’ clock domains that will not tolerate regioning

Post placement (and routing) tuning required

No ‘beneficial skew’ (shudder) possible


Problems with Meshes (#2)

Clock gating only easy at root

Fighting tools to do analysis:Clumped buffers a problem in Static Timing Analysis toolsLarge shorted meshes a problem for STA tools

Need Full extractions and Spice-Like simulation (e.g. Avant! Star-Sim) to determine skew


Benefits of Meshes (#3)

Deterministic since shielded all the way down to rib distribution

No ecoplacement required: all buffers preplaced before block placement

Low latency since uses shorted drivers, therefore lower skew

Ecoplacements of FFs later do not require rebalance of tree

“Idealized” clocking environment for concurrent RTL design and timing convergence dance.


Mesh Example

~ 100k flops

6 blocks


Clock Skew Thermal Map

Pre-tuning


Clock Skew Thermal Map #2

50ps block/ 100ps global skew, post tuning


Alternative Clock Network Strategy

Globally – Tree

Power requirements reduced relative to global grid

Smaller routing requirements, frees up global tracks

Trees balanced easily at global level

Keeps global skew low (with minimal process variation)



OutlineClocking

Storage elements


Clock distribution


Clock power issues

Gate timing models


Skew Reduction Using Package

• Most clock network latency occurs at global level (largest distances spanned)

• Latency ∝ Skew

• With reverse scaling, routing low-RC signals at global level becomes more difficult & area-consuming



System clock

µP/ASIC Solder bump

substrate

⇒ Incorporate globalclock distribution into the package

⇒ Flip-chip packaging allows for high density, low parasitic access from substrate to IC

• RC of package-level wiring up to 4 orders of magnitude smaller than on-chip wiring

• Global skew reduced

• Lower capacitance lower power

• Opens up global routing tracks

• Results not yet conclusive

Skew Reduction Using Package



Useful Skew (= cycle-stealing)

FF fast FF FFslow

Zero skew

hold setup hold setup

Timing Slacks

FF fast FF FFslow

Useful skew

hold setup hold setup

Useful skew• Local skew constraints• Shift slack to critical paths

Zero skew• Global skew constraint• All skew is bad

W. Dai, UC Santa Cruz


Skew = Local Constraint

D : longest pathd : shortest pathFF FF

safe

Skew

race condition cycle time violation

-d + thold Tperiod - D - tsetup< <

permissible range

Timing is correct as long as the signal arrives in the permissible skew range



Skew Scheduling for Design Robustness

“0 0 0”: at verge of violation

FF FF FF2 ns 6 ns T = 6 ns

“2 0 2”: more safety margin4 0

-22

4 0

Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on the edge



Potential Advantages of Useful Skew

CLK

0-skew

CLK

U-skew

Reduce peak current consumption by distributing the FF switch point in the range of permissible skew

Can exploit extra margin to increase clock frequency or reduce sizing (= power)



Conventional Zero-Skew Flow

PlacementPlacement

SynthesisSynthesis

Extraction & Delay CalculationExtraction & Delay Calculation

Static Timing AnalysisStatic Timing Analysis

0-Skew Clock Synthesis0-Skew Clock Synthesis

Clock RoutingClock Routing

Signal RoutingSignal Routing



Useful-Skew Flow

Existing PlacementExisting Placement

Extraction & Delay CalculationExtraction & Delay Calculation

Static Timing AnalysisStatic Timing Analysis

U-Skew Clock SynthesisU-Skew Clock Synthesis

Clock RoutingClock Routing

Signal RoutingSignal Routing

Permissible range generationPermissible range generation

Initial skew schedulingInitial skew scheduling

Clock tree topology synthesisClock tree topology synthesis

Clock net routingClock net routing

Clock timing verificationClock timing verification



OutlineClocking

Storage elements


Clock distribution

Package and used-skew degrees of freedom

Clock power issues

Gate timing models


Power consumption in clocks due to:Clock driversLong interconnectionsLarge clock loads – all clocked elements (latches, FF’s) are driven

Different components dominateDepending on type of clock network usedEx. Grid – huge pre-drivers & wire cap. drown out load cap.

Clock Power



Clock Power Is LARGE

Not only is the clock capacitance large, it switches every cycle!

P = α C Vdd2 f



Low-Power Clocking

Gated clocksGated clocksPrevent switching in areas of chip not being usedPrevent switching in areas of chip not being usedEasier in static designsEasier in static designs

EdgeEdge--triggered flops in ARM rather than transparent latches triggered flops in ARM rather than transparent latches in Alphain Alpha

Reduced load on clock for each latch/flopReduced load on clock for each latch/flopEliminated spurious powerEliminated spurious power--consuming transitions during latch flowconsuming transitions during latch flow--throughthrough



Clock Area

Clock networks consume silicon area (clock drivers, PLL, etc.) and routing area

Routing area is most vital

Top-level metals are used to reduce RC delaysThese levels are precious resources (unscaled)Power routing, clock routing, key global signals

Reducing area also reduces wiring capacitance and power

Typical #’s: Intel Itanium – 4% of M4/5 used in clock routing



Clock Slew Rates

To maintain signal integrity and latch performance, minimum slew rates are required

Too slow – clock is more susceptible to noise, latches are slowed down, setup times eat into timing budget [Tsetup = 200 + 0.33 * Tslew(ps)], more short-circuit power for large clock driversToo fast – burns too much power, overdesigned network, enhanced ground bounce

Rule-of-thumb: Trise and Tfall of clock are each between 10-20% of clock period (10% - aggressive target)

1 GHz clock; Trise = Tfall = 100-200ps



Example: Alpha 21264

Grid + H-tree approach

Power = 32% of total

Wire usage = 3% of metals 3 & 4

4 major clock quadrants, each with a large driver connected to local grid structures



Alpha 21264 Skew Map

Ref: Compaq, ASP-DAC00



Power vs. Skew

Fundamental design decisionMeeting skew requirements is easy with unlimited power budget

Wide wires reduce RC product but increase total CDriver upsizing reduces latency ( reduces skew as well) but increases buffer cap

SOC context: plastic package power limit is 2-3 W



Clock Distribution Trends

TimingClock period dropping fast, skew must followSlew rates must also scale with cycle timeJitter – PLL’s get better with CMOS scaling but other sources of noise increase

- Power supply noise more important- Switching-dependent temperature gradients

MaterialsCu reduces RC slew degradation, potential skewLow-k decreases power, improves latency, skew, slews

PowerComplexity, dynamic logic, pipelining more clock sinksLarger chips bigger clock networks



OutlineClocking

Storage elements


Clock distribution


Clock power issues

Gate timing models


Gate Timing Characterization

“Extract” exact transistor characteristics from layoutTransistor width, length, junction area and perimeterLocal wire length and inter-wire distance

Compute all transistor and wire capacitances

CL DA

B

F

CL


Cell Timing Characterization

Delay tables generated using a detailed transistor-level circuit simulator SPICE (differential-equations solver)

For a number of different input slews and load capacitances simulate the circuit of the cell

Propagation time (50% Vdd at input to 50% at output)Output slew (10% Vdd at output to 90% Vdd at output)

Time

tslew

tpd

Vdd


Non-linear effects reflected in tables

InputSlew

InputSlew

Delay at the gate

OutputCapacitance

OutputCapacitance

OutputSlew

IntrinsicDelay

Resulting waveform

DG = f (CL, Sin) and Sout = f (CL, Sin)Non-linear

Interpolate between table entries

Interpolation error is usually below 10% of SPICE


Conservatism of Gate Delay Modeling

True gate delay depends on input arrival time patterns

STA will assume that only 1 input is switchingWill use worst slope among several inputs

Time

A B Ftpd

Time

A Ftpd

Vdd

Vdd

DA

B

F

CLD

A

B

F

CL

Date post:	06-Feb-2018
Category:	Documents
Upload:	phamkhanh
View:	217 times
Download:	1 times

CSE241 VLSI Digital Circuits Winter 2003 Lecture 06:...

Documents