Analysis and Characterization of Random Skew
and Jitter in a Novel Clock Network
by
Vadim Gutnik
Bachelor of Science, Electrical Engineering and Computer Science,and Materials Science and Metals Engineering,
University of California at Berkeley (1994)
Master of Science, Electrical Engineering and Computer Science,Massachusetts Institute of Technology (1996)
Submitted to the Department of Electrical Engineering and ComputerScience
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2000
@ Massachusetts Institute of Technology 2000. All rights reserved.
AuthorDepartment of Electrical Cneering
*WtMASSACHUSETTS INSTITUTE
OF TECHNOLOGY
~.j-O%JUN 2 2 2000
...... .... LIBRARIESand Computer Science
March 3, 2000
C ertified by............................... .. .........Anantha Chandrakasan
Accepted by .....
Associate- P9essor of Electrical Engineering-S ervisor
Arthur C. SmithChairman, Departmental Committee on Graduate Students
Analysis and Characterization of Random Skew and Jitter in
a Novel Clock Network
by
Vadim Gutnik
Submitted to the Department of Electrical Engineering and Computer Scienceon March 3, 2000, in partial fulfillment of the
requirements for the degree ofDoctor of Science in Electrical Engineering
Abstract
System clock uncertainty, in the form of random skew and jitter, is beginning toaffect performance of large microprocessors significantly. Process and environmentalvariations and inter-signal coupling on a chip contribute significant delay variations inlong clock lines, and these variations are predicted to make the now widely-used clocktree distribution untenable. Distributed clock generation may allow clock networksto continue scaling with advances in semiconductor processing technology.
A novel clock network composed of multiple synchronized phase-locked loops is an-alyzed, implemented, and tested. Undesirable large-signal stable (modelocked) statesdictate the transfer characteristic of the phase detectors; a matrix formulation of thelinearized system allows direct calculation of system poles for any desired oscillatorconfiguration. The circuits were fabricated in CMOS, and two implementations ofthe system - a 4 oscillator proof-of-concept 400MHz network, and a 16-oscillator,1.3GHz network network are presented.
A flash time-to-digital converter is presented that exploits parallelism to get pre-cise time measurements with resolution much smaller than a single gate delay. Unfor-tunately, an unrelated failure precluded measurements on the 16-oscillator chip wherethe measurement system was integrated, but the principle is shown to be valid on anindependent test chip.
Thesis Supervisor: Anantha ChandrakasanTitle: Associate Professor of Electrical Engineering
3
Acknowledgments
I would like to thank my thesis advisor, Professor Chandrakasan for innumerable
technical discussions, for always being available and approachable, and for making
sure I could concentrate on thesis work. Thanks also to my thesis readers Professors
Boning and Verghese for their help in organizing the thesis.
Thanks goes to my research group as well; my research would have been much less
enjoyable and much less successful were it not for their advice, help, and camaraderie.
And of course, thanks to my family for putting up with me through an awful lot
of years of school.
5
Contents
1 Clocks in Digital Systems
1.1 D efinitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 T hesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Models of Clock Network Timing Variations
2.1 Previous Work: Clocks ....................
2.1.1 Equipotential Clocking . . . . . . . . . . . . .
2.1.2 H-Trees and Generalized Trees . . . . . . . . .
2.1.3 Active Skew Management . . . . . . . . . . .
2.2 Previous Work: Variations . . . . . . . . . . . . . . .
2.2.1 Layout-Dependent Processing Variations . . .
2.2.2 Wafer-Scale and Random Physical Variations
2.2.3 Circuit Implications of Mismatch . . . . . . .
2.2.4 Abstract Variation Models . . . . . . . . . . .
2.3 Categories of Mismatch . . . . . . . . . . . . . . . . .
2.4 Clock Architecture Comparison . . . . . . . . . . . .
2.4.1 Clock m etric . . . . . . . . . . . . . . . . . . .
2.4.2 T ree . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 G rid . . . . . . . . . . . . . . . . . . . . . . .
2.4.4 Active Feedback . . . . . . . . . . . . . . . . .
3 Synchronization and Stability
3.1 Previous Work: Synchronization . . . . . . . . . . . . . . . . . . . . .
7
15
15
21
23
. . . . . . . . . 23
. . . . . . . . . 24
. . . . . . . . . 25
. . . . . . . . . 27
. . . . . . . . . 27
. . . . . . . . . 28
. . . . . . . . . 28
. . . . . . . . . 29
. . . . . . . . . 31
. . . . . . . . . 32
. . . . . . . . . 35
. . . . . . . . . 35
. . . . . . . . . 36
. . . . . . . . . 39
. . . . . . . . . 42
49
49
3.1.1 Local Data Synchronization
3.1.2 Local Clock Synchronization
3.2 Proposed Clock Architecture . . . .
3.3 Small Signal
3.3.1
3.3.2
3.4 Large
General Derivation .
Examples . . . . . .
Signal: Mode Locking
4 Implementation and Testing
4.1 4 Oscillator Chip . . . . .
4.1.1 Oscillator . . . . .
4.1.2 Phase Detector . .
4.1.3 Loop Filter . . . .
4.2 16 Oscillator Chip . . . . .
4.2.1 Oscillator . . . . .
4.2.2 Phase Detector . .
4.2.3 Loop Filter . . . .
Distributed Clocks
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
5 On-Chip Measurement of Clock Performance
5.1
5.2
5.3
5.4
5.5
Introduction and Motivation . . . . . . .
Time-to-Digital Converter Fundamentals
SOTDC Yield . . . . . . . . . . . . . . .
Calibration of a SOTDC . . . . . . . . .
Circuit and Results . . . . . . . . . . . .
6 Conclusions
6.1 Summary and Contributions . . .
6.2 Future Work . . . . . . . . . . . .
6.2.1 Testing and measurement
6.2.2 Unconventional Clocks . .
8
.
49
51
52
52
53
56
62
69
69
71
71
74
77
77
77
80
83
83
85
87
87
90
95
95
96
96
97
A Full Schematics 109
A.1 4 oscillator chip ....... .............................. 109
A .2 16 oscillator chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9
List of Figures
1-1 2 bit synchronous counter
1-2
1-4
1-3
Timing diagram for 3-counter . .
Relationship of clock offset, skew,
Two paths in a clock network . .
and jitter.
2-1 Alpha clock grid evolution . . . . . . . . . . . . .
2-2 Four-level H-tree . . . . . . . . . . . . . . . . . .
2-3 Zero-skew balanced tree . . . . . . . . . . . . . .
2-4 Digital active deskewing . . . . . . . . . . . . . .
2-5 Skew caused by finite rise time . . . . . . . . . .
2-6 Independent balancing of NFETs and PFETS . .
2-7 Example H-tree . . . . . . . . . . . . . . . . . . .
2-8 Schematic model of capacitive coupling . . . . . .
2-9 Clock tree tradeoffs . . . . . . . . . . . . . . . . .
2-10 Grid distribution block schematic . . . . . . . . .
2-11 Model circuit for shorted grid drivers. . . . . . .
2-12 Power vs. skew for a grid. . . . . . . . . . . . . .
2-13 Simulated edge in a grid with skew to the drivers.
2-14 Short circuit power in a grid vs. input tree skew.
2-15 Low-skew wire with DLL . . . . . . . . . . . . .
2-16 Matching tree leaves with a DLL . . . . . . . . .
2-17 Matching tree leaves with two DLLs . . . . . . .
11
16
. . . . . . . . . . . . . 16
. . . . . . . . . . . . . 18
. . . . . . . . . . . . . 18
. . . . . . . . . . 2 5
. . . . . . . . . . 2 5
. . . . . . . . . . 2 6
. . . . . . . . . . 2 7
. . . . . . . . . . 2 9
. . . . . . . . . . 3 0
. . . . . . . . . . 3 3
. . . . . . . . . . 3 6
. . . . . . . . . . 3 8
. . . . . . . . . . 3 9
. . . . . . . . . . 4 0
. . . . . . . . . . 4 1
. . . . . . . . . . 4 2
. . . . . . . . . . 4 3
. . . . . . . . . . 4 3
. . . . . . . . . . 4 4
. . . . . . . . . . 4 5
2-18 Matching tree leaves with a two DLLs which requires delay cell
. . . . . . . . . . . . . . . . . 4 5
DLL architecture . . . . . . . . . . . . . . . . . . . .
Multi-input delay cell DLL architecture . . . . . . .
Tile number optimization . . . . . . . . . . . . . . .
A variable delay element and phase comparator can
into a DLL or a PLL. . . . . . . . . . . . . . . . . .
be configured
Mode-locking example . . . . . . . . . . . . . . . . . . . . .
Distributed clocking network . . . . . . . . . . . . . . . . .
Standard phase-locked loop. . . . . . . . . . . . . . . . . . .
Linear system model of a standard phase-locked loop.....
Multi-oscillator phase-locked loop . . . . . . . . . . . . . . .
Linear system model of a multi-oscillator phase-locked loop
PLL loop gain Bode plots . . . . . . . . . . . . . . . . . . .
Root locus for single-oscillator PLL with gain error . . . . .
Asymmetrical one-dimensional PLL array . . . . . . . . . .
Symmetrical one-dimensional PLL array . . . . . . . . . . .
Root locus for a one-dimensional array of PLLs. . . . . . . .
Comparison of noise responses for symmetrical and asymr
netw orks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Root locus for a two-dimensional array of PLLs. . . . . . . .
Mode-locking example . . . . . . . . . . . . . . . . . . . . .
. . . . 51
. . . . 54
. . . . 54
. . . . 54
. . . . 55
. . . . 55
57
. . . . 58
. . . . 58
. . . . 59
. . . . 60
etrical
3-1
3-2
3-3
3-4
3-5
3-6
3-7
3-8
3-9
3-10
3-11
3-12
3-13
3-14
Micrograph of the 4 oscillator, 350 MHz chip . . . .
Relaxation oscillator layout . . . . . . . . . . . . . .
Relaxation oscillator schematic . . . . . . . . . . . .
Phase detector schematic . . . . . . . . . . . . . . .
Phase detector timing waveforms . . . . . . . . . . .
Sampled phase detector half-circuit transfer function
Sampled phase detector full transfer function . . . .
12
46
47
47
48
2-19
2-20
2-21
2-22
61
63
64
4-1
4-3
4-2
4-4
4-5
4-6
4-7
. . . . . . . . 70
. . . . . . . . 72
. . . . . . . . 73
. . . . . . . . 74
. . . . . . . . 75
. . . . . . . . 75
. . . . . . . . 76
matching
Loop filter schematic . . . . . . . .
Micrograph of the 16 oscillator, 1.3
Ring oscillator schematic . . . . . .
Phase detector . . . . . . . . . . .
Simulated phase transfer curve . .
Locking behavior of the PLL array
Loop filter schematic . . . . . . . .
GHz chip
4-8
4-9
4-10
4-11
4-12
4-13
4-14
5-1
5-2
5-3
5-4
5-5
5-6
5-7
5-8
5-9
5-10
A1.1
A1.2
A1.3
A1.4
A1.5
A1.6
A1.7
A2.1
A2.2
A2.3
A2.4
A2.5
and "A" the arbiters. .
standard deviation of t,
o- = 0.35ps . . . . . . .
. . . . . . . . . . . . . .
13
76
78
79
80
81
81
82
83
84
86
86
88
89
91
92
92
93
Time to voltage converter operation . . .
Phase vernier . . . . . . . . . . . . . . . .
Arbiter definitions . . . . . . . . . . . . .
TDC structure. "D" marks delay elements,
X (i) vs. i . . . . . . . . . . . . . . . . . .
SOTDC yield . . . . . . . . . . . . . . . .
Symmetric CMOS arbiter . . . . . . . . .
Measured xi, with expected curve for 18ps
Measured xi vs. xi derived via Eq. 5.9, for
Measurement chip micrograph . . . . . . .
Top-level (chip core) . . . . . . . . . . . .
N ode . . . . . . . . . . . . . . . . . . . . .
Relaxation oscillator . . . . . . . . . . . .
Compensation amplifier and summer . . .
Differential to single-ended amplifier . . .
Sampled phase comparator . . . . . . . .
Phase comparator core . . . . . . . . . . .
Top-level (chip core) . . . . . . . . . . . .
Individual tile . . . . . . . . . . . . . . . .
N ode . . . . . . . . . . . . . . . . . . . . .
Compensation amplifier . . . . . . . . . .
Ring oscillator . . . . . . . . . . . . . . .
110
111
111
112
112
113
114
115
116
116
117
117
A2.6 Differential inverter for the ring oscillator . . . . . . . . . . . . . . 118
A2.7 Clock divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
A2.8 Jitter measurement block . . . . . . . . . . . . . . . . . . . . . . . 119
A2.9 Pulse generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
A2.10 DRAM block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
A2.11 DRAM write token . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A2.12 DRAM bitslice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A2.13 Phase measurement arbiter . . . . . . . . . . . . . . . . . . . . . . 121
A2.14 Dram data 3-state driver . . . . . . . . . . . . . . . . . . . . . . . . 122
A2.15 Dram output data serializer . . . . . . . . . . . . . . . . . . . . . . 122
14
Chapter 1
Clocks in Digital Systems
The vast majority of integrated circuits manufactured today are synchronous digital
systems. The performance of these systems, measured in terms of computation per
time, is readily increased by increasing the clock rate. The bulk of the effort in design
of high speed systems is expended on the design of systems that operate correctly
when synchronized by ever faster clocks. An increasing amount of effort has been
made in designing the clocks themselves so that imperfections in the clock do not
unnecessarily limit system performance. This chapter introduces terminology and
constraints relevant to clock performance in digital systems.
1.1 Definitions
Digital devices can be modeled as finite state machines: a set of registers holds the
current state, combinational logic computes the next state, and at specific instants
the registers are loaded with the newly computed state. In the majority of digital
systems, where the registers are designed to be loaded at the same time, a periodic
synchronization signal, or clock, must be distributed throughout the system [1]. The
clock distribution network of a modern microprocessor uses a significant fraction of
the total chip power and has substantial impact on the overall performance of the
system. For example, the 72 watt, 600 MHz Alpha processor [2] dissipates 16 watts
in the global clock distribution, and another 23 watts in the local clocks: more than
15
D Q D Q
RO Ri QClockO QO Clock1
Figure 1-1: 2 bit synchronous counter
QO/D1
Q1
DO
<QIQO> 0 000 01 00 01 10 00
ClockO
Clocki
1 2 3 4 5 6 7 8 Time
Figure 1-2: Timing diagram for 3-counter
half the power goes to driving the clock net!
While clock design issues can be subtle, the main performance criteria for the
system clock are straightforward. Consider a simple example. Fig. 1-1 shows a
simple digital circuit: a synchronous counter that counts to 3. The associated timing
waveforms are shown in Fig. 1-2. For the first several cycles shown, the circuit works
correctly, and counts 00, 01, 10, 00. However, for a number of reasons described
below, actual clock signals are neither perfectly periodic nor perfectly simultaneous.
This timing imperfection can lead to two types of timing errors.
The first type of timing error occurs when clockO arrives early at cycle 4: in this
case, the data from Q1 does not have time to propagate through the NOR gate, so the
wrong value is latched into RO. Formally, this may be called a "setup time violation,"
because the correct value was not present at the input to a latch sufficiently before a
16
clock edge. A setup violation occurs if
Ti,n + tcQ + togic > T,n+l - tsetup (1.1)
where Ti,n is the time of arrival of the nWh edge at the ith flip flop, tcQ is the clock-to-Q
time for the ith flip flop, t1 09 ic is the worst case (longest) logic delay between the it"
and jth flip flops, and tsetup is the setup time for the Jh flip flop. Note that i could
equal j.
The second type of timing failure happens when clockl arrives too late at cycle 6:
the 0 that RO latches on this cycle propagates to the input of R1 and is latched instead
of the correct value, formally because of a hold time violation on R1. Colloquially,
the value is said to have "raced through" latch Ri. A hold violation occurs if
Ti,n + tCQ + ilogic < T,n + thold (1.2)
where thold is the hold time for the Jth register, and ilogic is the worst case (shortest)
logic delay.
Setup and hold violations are different in a number of ways. Setup violations occur
because some instantaneous clock period is too short, and can be averted by lowering
the nominal clock frequency. Because setup violations involve successive clock edges,
possibly at the same register, they are typically considered to be a result of temporal
clock variation. Hold violations, on the other hand, involve arrivals of the same edge
at multiple registers; they result from spatial clock variation. Slowing down the clock
does nothing to avert hold violations; instead, the effective hold time of the offending
registers must be increased, often by adding pairs of inverters after the register.
Traditionally, clock networks have been characterized in terms of skew, the spatial
variations in arrival times, or T,(i, j) T - Tj; and jitter, the temporal variation in
clock period at a node, Tj(n) = Ti,+- Ti,n - Tperiod. Rewriting Eq. 1.1 and Eq. 1.2
17
x(1) x(2) x(3)
Ideal Clock
Clock x LL
1 2 3 Time
(a) Definition of clock time offset
I Clock A
0-
4I)
o"~ dl Jitter
Skew
- Clock B
Time
(c) Conventional view of skew and jitter
0
Clock x
1 2 3 Time
(b) Time offset plot for a singleclock
0- NA Clock AA
Clock B
A
'NTime
(d) Skew and jitter in modernclocks are comingled
Figure 1-4: Relationship of clock offset, skew, and jitter.
in terms of skew and jitter gives
Ts (i, j) - T (n)
TS (i, A )
> tsetu + tCQ - tlogic
> tCQ + liogic - thold
Delay A A
DelayBB
Figure 1-3: Two paths in a
clock networkond late, it would also arrive
In older clock networks, the clock source was the source
for the majority of jitter so jitter was the same for all
the clock nodes. Referring to Fig. 1-3, the assumption
was the delay to each of paths A and B is a constant,
and the only source of time-dependent noise is the clock
source. Hence, if clock arrives at node A one nanosec-
at node B one nanosecond too late. Dually, skew was
18
(1.3)
(1.4)
A-
caused by static path-length mismatches to the clock loads, so skew was constant
from cycle to cycle. If on one clock cycle the clock at B lagged the clock at A by one
nanosecond, it would lag by one nanosecond at the next clock cycle as well. If we
plot the time offset from an ideal clock, defined in Fig. 1-4(a), vs. time for a single
clock, we'd expect to see something like Fig. 1-4(b). The traditional model suggests
that two on-chip clocks behave as shown in Fig. 1-4(c). In modern clock systems,
however, delay from the clock source to the loads dominates both static and dynamic
mismatches, so arrival times at different nodes are not necessarily correlated. If the
clock arrival time at node A is not correlated with the arrival time at node B, the
jitter at B need not match the jitter at A, and the skew between A and B becomes
time-varying, as shown in Fig. 1-4(d). This means that the skew and jitter terms
in Eq. 1.3 and Eq. 1.4 would have to be fully indexed for sample time and location.
In short, there is little reason to treat skew and jitter separately in modern clock
networks.
For this reason, this thesis uses "clock skew" and "clock uncertainty" interchange-
ably to mean the difference between the actual clock arrival time and the nominal
arrival time, whether the reference is established by spatially or temporally distinct
clock edge. Aside from avoiding semantic distinction between skew and jitter, this
usage allows us to consider skew and jitter contributions of individual clock paths,
rather than pairs of paths. (This is an exact clock network analog of analyzing half-
circuits in amplifier design.)
Just as there are distinctions between types of timing errors (hold vs. setup
violations), and between types of clock uncertainty (skew vs. jitter), there are sev-
eral divisions in the sources of clock uncertainty. First, errors can be divided into
systematic or random. Systematic errors are due to layout-dependent parameter
variations, length variations in the lines, load capacitance mismatches, etc. That is,
any variations that are the same from chip to chip. In principle, such errors could
be modeled and corrected at design time given sufficiently good simulators. Failing
that, systematic errors can be deduced from measurements over a set of chips, and the
design adjusted to compensate. Random errors are due to manufacturing variations,
19
inter-signal coupling (which is predictable but often too hard to model correctly),
thermal- and slow supply voltage-gradients, power-supply-noise-induced delay varia-
tions in buffers, and to some extent, thermal noise. It is impossible to eliminate some
sources of random clock uncertainty, but it is possible to model some of the skew and
jitter sources, and to design in a way that minimizes their effects.
Mismatch may also be characterized as static or time-varying. In practice, there
is a continuum between changes that are slower than the time constant of interest
and those that are faster. For example, temperature variations on a chip vary on a
millisecond time scale. A clock network tuned by a one-time calibration or trimming
would be vulnerable to time-varying mismatch due to varying thermal gradients. On
the other hand, to a feedback network with a bandwidth of several megahertz, thermal
changes appear essentially static. Note the caveat that time-varying signals can cause
static errors as long as they are periodic with the clock. For example, the clock net is
usually by far the largest single net on the chip, and simultaneous transitions on the
clock drivers induces noise on the power supply. However, this high speed effect does
not contribute to time-varying mismatch because it is the same on every clock cycle,
and hence affects each rising clock edge the same way. Of course, this power supply
glitch may still cause static mismatch if it is not the same throughout the chip.
Finally, random skew can be subdivided into spatially correlated and spatially
uncorrelated mismatch. (Note the similarity to static and time-varying mismatch,
which could be restated as temporally correlated and uncorrelated). Again, the dis-
tinction is not absolute. Different physical parameters will have different correlation
distances; hence it is possible for a single pair of wires to be correlated in one respect
but not in the other. Table 1.1 shows the categories and several examples of the
sources of each type of random mismatch.
correlated uncorrelatedstatic wafer-scale etching, polishing MOSFET channel doping
and lithography gradientstime-varying temperature and power-supply value-dependent load capaci-
gradients tance, inter-signal coupling
Table 1.1: Categorization and example sources of non-systematic mismatch
20
1.2 Thesis Scope
As argued in Chapter 2, signal delay across a microprocessor chip measured in clock
cycles has been increasing as technology scales to smaller feature sizes, and is now
comparable to one clock cycle. Because clock uncertainty scales with path delay,
relatively longer delays increase the fraction of clock uncertainty per clock cycle; this
trend could severely limit performance if not corrected. The overall goal of this thesis
was to examine clock performance at both the circuit and the architectural level to
find ways to design clocks in an environment where performance is limited by random
random physical mismatches and noise.
This thesis is split into three parts. The first part, Chapter 2, analyzes how
sources of skew and jitter affect different clock architectures. The nonintuitive result
is that a tree architecture is not well suited to systems where cycle time is shorter
than cross-chip path delay, and that distributed clock networks become increasingly
attractive.
This analysis leads into the second part, which proposes a novel clock network
composed of multiple synchronized phase-locked loops. Chapter 3 covers large- and
small-signal stability of the system. Undesirable large-signal stable (modelocked)
states dictate the transfer characteristic of the phase detectors; a matrix formula-
tion of the linearized system allows direct calculation of system poles for any desired
oscillator configuration. Chapter 4 deals with circuit implementation in CMOS, pre-
senting two implementations of the system- a 4 oscillator proof-of-concept 400MHz
network, and a 16-oscillator, 1.3GHz network network.
The last part of the thesis, Chapter 5, examines ways to measure performance
of a high-speed clock. As clock performance is optimized for fast operation, it be-
comes increasingly difficult to measure clock jitter. A flash time-to-digital converter
is presented that exploits parallelism to get precise time measurements with reso-
lution much smaller than a single gate delay. Unfortunately, an unrelated failure
precluded measurements on the 16-oscillator chip where the measurement system
was integrated, but the principle is shown to be valid on an independent test chip.
21
Chapter 2
Models of Clock Network Timing
Variations
Unpredictable parameter variations and noise are becoming dominant concerns for
clocks. Clock networks have traditionally been optimized for minimum design time
(gridded clocks) or power and wireability (trees). Process variations, on the other
hand, have been studied extensively in terms of matching limitations on analog cir-
cuits, and to some extent in individual clock architectures. This chapter considers
how clock uncertainty depends on both architecture and imposed mismatch.
2.1 Previous Work: Clocks
Consider first the taxonomy and evolution of clock networks. Note that a great deal
of work nominally about "clocking" has gone into finding the exact sequence of timing
signals needed to clock a microprocessor at the fastest possible speed [3, 4, 5, 6, 7, 8, 9],
and a number of CAD tools have been developed to find and verify such timing
schedules [10, 11, 12]. However, the analysis of what timing signals are needed is
independent of how the signals are distributed. Unpredictable variations are no more
tolerated in scheduled-skew designs than in ideally zero-skew designs. The remaining
discussion will assume that the optimal clocking schedule has already been determined
and that what remains is implementation.
23
2.1.1 Equipotential Clocking
Conceptually the simplest clocking strategy is to distribute a global clock to the
chip as a regular, though heavily loaded, signal line. This is known as equipotential
clocking because the implicit assumption is that resistance in the wires is negligible
and the entire net is always at a uniform voltage. For small nets with relatively
few clock loads and a slow clock, this works well. For large chips and fast clocks,
equipotential clocking has the advantage that most of the clock distribution network
can be designed independently of the logic.
In fact, there is some RC time constant (T) associated with the wires of such
a clock net. When T is small compared to the clock period, the RC delays are
unimportant. As feature sizes scale down, however, T increases and clock rates go up,
so the net no longer appears as a lumped capacitance and acts instead a lossy delay
line. Propagation delays along the clock net cause skew. Because T scales with the
size of the net, equipotential clocking can still be used for subsections of a chip [13],
and implicitly at the lowest level in hierarchical [14] and distributed [15, 16] designs.
The tour de force of equipotential clocking was the first DEC Alpha chip [17]
(Fig. 2-1(a)). In that design, a single, segmented buffer placed lengthwise in the
center of the die drives a grid made using two upper metal layers (i.e., the thickest
metal available, to lower T). The worst-case time difference between clock arrivals
was 200 picoseconds, and this was sufficient for a 200 MHz clock.
The next two versions, the 300 MHz Alpha and its strikingly similar 433 MHz
cousin, [18, 19] both used two drivers for the entire grid (Fig. 2-1(b)). Why? With
higher clock speeds, the RC delay from the center of the chip to the edges becomes
significant; the two drivers effectively both drive halves of the chip, so the delays are
shorter. The 600 MHz Alpha [2] (Fig. 2-1(c)) followed this trend: it has four top-level
buffers, because with the higher clock speeds and wire delays, ever smaller sections
of the chip can be modeled as equipotentials.
24
Wire Grid Drivers
-o---
Clock
(b) Two-driver grid
Driver
I I
Figure 2-1: Evolution of Alpha's grid based clock network. In all cases, large buffersdrive a regular mesh of metal2 and metal3 wires.
2.1.2 H-Trees and Generalized Trees
If it were possible to lay out the clock net so that all points where the clock is used
are equidistant from the clock driver, the wire delay would not cause skew. This idea
led to H-trees (Fig. 2-2) [20, 21, 14].
By symmetry, the distance from the center of
the net (the root of the tree), to each of the ends
(leaves), is the same. Therefore, regardless of T,
signals should arrive at the leaves at the same
time. The clock can then be distributed to a
smaller (approximately equipotential) net around
each leaf. The size of this equipotential region
around each leaf shrinks as the depth of the tree
increases, so deeper trees are needed for faster
clock speeds.
The maximum clock frequency is limited by
dispersion of pulses on the RC wires, so the basic
Leaf Leaf Leaf ...
Root
Leaf
Figure 2-2: Four level H-Tree.
Paths from the center to the
leaves are geometrically the same.
H-tree can be improved immediately by symmetrically inserting buffers along the
25
Drivers
I I
I I-- -- --- ----
Clock Metal Strap
(c) Windowpane grid
zlzI±Iz -
(a) One-driver grid
branches to regenerate the signal [21, 22, 15, 14]. Clock trees are insensitive to global
process and environmental variations; skew is still zero if the resistance of the wires
is higher than expected, say, or if the input threshold to all the buffers changes. Of
course, H-trees are affected by intra-die variations [23, 24]. Anything that causes
similar paths on the different parts of the chip to have different delays (e.g., local
line width variations, temperature gradients, varying threshold voltages, etc.) causes
skew.
H-trees are most useful when clocking regular arrays, because the leaves form a
regular grid. What can be done if the clock loading is not so geometrically regular?
The vital feature of H-trees is that the distance from the root to all the leaves is the
same. Finding a balanced tree for an arbitrary set of points is known as the zero-
skew tree problem. In general, finding a zero-skew tree with minimum total length
is exceptionally hard; however, a number of heuristic algorithms have been proposed
[25, 26, 27, 28, 29]. Closely related to the zero-skew problem is the bounded skew tree
problem, where a small amount of path difference is allowed to help minimize the
total wire length, and therefore minimize area and power dissipation [30].
All of these tree approaches are bottom-up
algorithms that start by connecting groups of
nodes into a tree and then merging trees until
Leaves only one net remains. They are distinguished
by exactly how they merge trees, behavior in
pathological cases, how the number of compu-Root tations scales with the number of clock loads,
Figure 2-3: Zero-skew balanced tree how they route around obstructions, etc. The
result is essentially the same, however: they all
produce an irregular clock tree that ties together a specified set of clock loads such
that the distance from the root to the leaves is approximately equal (Fig. 2-3). Most
modern processors use some version of such trees to distribute the clock [31, 32, 33, 34].
Those that do not use explicit trees still simulate and balance path delays from the
clock source to all the loads, so act essentially as generalized clock trees. There the
26
Global Clock
Delay Delay-_
-Compare+-
Figure 2-4: Digital active deskewing
matching is generally less precise, because the delay to the leaves, while nominally
identical, is composed of the delays of a variable number of gates and length of wire,
so even global variations in a particular parameter may cause skew.
2.1.3 Active Skew Management
One approach to measure and cancel out static skew involves splitting the H-tree
into two halves, measuring the relative offset between the two, and applying the
appropriate delay, as shown in Fig. 2-4 [35]. In this structure, the delays and control
signals are digital; this adds a measure of noise immunity, but increases the overhead
power and area. Further, the model does not scale well - there is explicit digital
control to guarantee that the delays do not both continue to increase. Splitting the
tree into more sections allows finer adjustment, but the control overhead increases
rapidly as well.
2.2 Previous Work: Variations
Because the goal of a clock network is to distribute an identical signal to multiple
locations, device and interconnect matching is important. Environmental variables,
such as supply voltage, switching activity and temperature depend on the design of
27
the chip, and hence are under the control of the designer. Conversely, processing
variables, including film thickness, lateral lengths, resistivity, etc., are defined by the
manufacturing process, and can be treated as imposed constraints [43]. This section
describes some of the approaches to modeling the constraints and their effects on
circuits.
2.2.1 Layout-Dependent Processing Variations
Some manufacturing process steps, most notably etching, chemical-mechanical pol-
ishing (CMP) and lithography, are influenced by topography on a chip. This layout-
depending processing causes systematic device and interconnect variations [43, 44, 45].
Modeling this variation falls into the realm of statistical metrology; see [46] for a re-
view. This systematic variation need not limit clock performance, however. Design
rules are evolving to ensure layout pattern uniformity. For some effects, it may be
feasible to add a spatially-varying fabrication mask offset, just as masks are made
by adjusting the drawn layout to compensate for lithography and etching biases.
As a last resort, clock performance can be measured and systematic offsets can be
compensated in the design.
2.2.2 Wafer-Scale and Random Physical Variations
Unlike systematic skew, skew caused by random physical variations is unavoidable.
For example, a dominant source of device mismatch over small areas is V variation
due to stochastic distribution of dopants; variation depends only on channel area
[47, 45, 48, 49]. Wafer-scale non-uniformity, while not truly random, varies from chip
to chip. For example, deposited thin films often have a radially-symmetric thickness
profile across a wafer. This results in slants in parameter properties across chips that
depend on position of the chip within a wafer, and hence cannot be compensated on
chip [43].
28
Voltage
Vth max
Vth min- - --
Time
tO t1 t2 t3
Figure 2-5: Clock skew caused by finite signal rise time. t1 - to and t3 - t 2 is skewdue to variable buffer threshold voltages. t3 - ti and t 2 - to is due to variable risetime. t3 - to shows the worst case combined effect.
2.2.3 Circuit Implications of Mismatch
Processing mismatch translates directly into loss of clock performance. For example,
variations in saturation current or buffer thresholds can both lead to variable clock
arrival times, as shown in Fig. 2-5 [21, 20]. Exact numbers are not easily available,
but one may assume that there could be 10% dynamic variation in VDD across a chip
(which affects the threshold and drive current) and another 5% variation in IDSS
between two distant, though nominally matched, buffers. That leads to an expected
clock skew of 2.5% of the total clock cycle from a single pair of gates! In the current
regime, where the clock skew budget is approximately 10% of the clock period, this
is quite substantial [22, 50, 51]. Attempts to increase the maximum clock speed by
increasing pipelining along an H-tree exacerbate this effect [52].
Because random variations cause substantial skew, there have been a number of
attempts to minimize mismatches at the circuit level. For example, it was noticed that
due to poor matching between nfets and pfets, signal paths which do not match the
nfets and pfets separately may add skew unnecessarily [53]. The canonical example is
shown in Fig. 2-6. On a rising input clock edge, gates N1, P2 and N3 are turned on
in the top chain and N4 and P5 in the bottom chain. Because nfets may be expected
to track nfets better than pfets, and vice versa, the lowest skew is achieved by sizing
29
P1 P2 P3Clocki
N1 N2 N3
ClockInput
I n p u t 4 P 5 C l o c k 2
N4 N5
Figure 2-6: Independent balancing of NFETs and PFETS
the transistors so that dN1 + dN3 = dN4 and dP2 = dP5 where dN1 is the delay
due to transistor N1, etc. The general observation is that matching is best between
similar components. One cannot expect wire delays to match gate delays over all
process corners, for example.
Clock designers have also started to pay attention to wisdom from analog design:
matching is best between similar elements, and matching between identical elements
is improved by making them larger. For example, matching wire delays to gate delays
is likely to lead to random skew. And when matching delays through a clock tree, at
some times fast paths need to be slowed down. There are two straightforward ways to
accomplish this: make the wires longer or make them wider. Which is better? Wider
wires are preferable because of the diminished influence of edge effects [50, 54, 55].
Consideration of random variations is becoming increasingly important in clock
designs. The solutions tend to be ad hoc, and there has been little work on how well
physically separated components may be expected to match. And most clock trees
are still designed to achieve minimal nominal skew without consideration for how
random variations will affect performance.
30
2.2.4 Abstract Variation Models
At the other end of the extreme from the ad hoc physical models are the abstract
models for skew [15, 56, 42, 57]. The assumption in these models is that skew is caused
by uncorrelated, random variations in the clock distribution network. Unfortunately,
because they are so far removed from implementation, generic statistical models give
somewhat misleading results, for several reasons.
The first is that they are too optimistic about statistical independence of vari-
ations. For example, gates that are near each other are likely to match each other
more so than gates that are physically separated. This means that the sum of the
skews caused by gates in any signal path will have higher variance than would the
sum of skews caused by the same number of gates randomly selected from the chip.
Also, as has been pointed out, not all variations have the same weight in the final
skew: clock trees, for example, are much more sensitive to differences at the root of
the tree than at the leaves [56].
Ironically, the second weakness is that general statistical models can be too pes-
simistic as well. For example, an analysis of pulse width down a long line of buffers
suggests that the pulse-width follows a random walk [57]. Thus, it is argued, the
pulse might disappear entirely unless the clock period is sufficiently long. In fact, it
is not particularly hard to add feedback to ensure a 50% duty cycle, which effectively
limits the random walk. In this case and some others, circuit tricks can overcome
apparent stochastic barriers [15].
Fundamentally, the very generality that makes sweeping statistical statements
interesting is their weakness because such bounds do not take into account circuit
or architectural changes that affect network performance. Although they may place
bounds on clock performance, they are necessarily qualitative, and can neither suggest
circuit improvements nor take them into account.
31
2.3 Categories of Mismatch
All on-chip clock networks rely on device parameter matching. This is a crucial
difference between logic critical paths and clock networks: variation in critical path
delay can be overcome by speeding up the critical path so that the worst-case delay
meets timing constraints [58]. Time-dependency logic delay can be included directly
in the worst-case timing estimates: maximum delay is constrained by Eq. 1.3 and
minimum delay by Eq. 1.4. In contrast, because the clock network itself establishes
the timing, both too-slow and too fast clocks must be avoided. Physical variations
are often separated into separated into local and global contributions [59]. For the
purposes of clock distribution, time-varying mismatch must be considered explicitly
as jitter (and, if uncorrelated spatially, as contributing to skew). 1
Integrated circuit fabrication processes generally result in wafer-scale gradients
in line width (both metal and polysilicon), thin film thickness (metal wires, gate
oxide, interlayer dielectric) and doping concentration [43]. Manufacturing gradients
have been cited to explain distance-dependent mismatch in transistors [60]. These
variations significantly affect device and interconnect performance. In minimum-size
inverters, for example, Leff variation can lead to 9% delay mismatch [61] between
chips; in a different process 37% variation of ring oscillator speed was reported within
single dies [62]. Clocks depend on matching rather than absolute delays, and are
therefore insensitive to truly global parameter variations. We also make the optimistic
assumptions thatall systematic variations are compensated. This could be achieved
via modeling (i.e., statistical metrology), or simply testing finished chips if multiple
silicon revisions are to be made.
However, because clock networks span an entire chip, wafer-scale gradients are
noticeable. It is generally accepted that global effects can be ignored for distances
smaller than 100pm, but are noticeable for distances larger than 1mm [47, 60]. Global
environmental variations, specifically in temperature and DC supply voltage variation,
'There is a subtle asymmetry between temporal variation in logic and clock. Slack in Eq. 1.4 cannot be exploited to decrease clock cycle time, while any decrease in clock uncertainty directly lowersthe minimum clock period. For this reason, temporal variations of the clock are analyzed explicitly.
32
Figure 2-7: Example H-tree
Segment 1 2 3 4 5 6 7 AverageXi 0.1 0.3 0.5 0.5 0.5 0.4 0.25 .36
Table 2.1: Contributions to skew for an H-tree
are imposed by design rather than fabrication, but are otherwise similar in effect.
Temperature affects resistivity of the metal, channel mobility, and threshold voltages,
and supply voltage affects saturation currents and hence gate delay [63].
The distance between most nominally matched components of a clock distribution
network is comparable to chip size, which is typically 1cm or larger. Fig. 2-7 shows
an example H-tree, and the distances xi, normalized to chip size, between nominally
matched wire segments are tabulated in Table 2.1. Most of the distances are com-
parable to the size of a chip; hence, we may expect that the wafer-scale variations
are dominant and consider inter-chip mismatch data. Still, this brings up a messy
modeling issue.
Delay along a clock wire is a sum of small delays. The delay of each buffer-
33
x7
x5
x6
x4
X1x3
x2
wire-buffer segment contributes a small random component. If the segments are
strictly independent (e.g., uncorrelated threshold voltage variations), the variance
along the wire is the sum of individual variances, so the standard deviation of the
resulting offset increases as the square root of the length of the wire. Another model
is that the mismatch is due to a gradient of delays across a chip (perhaps from thin-
film deposition). Because the linear gradient is summed, the mismatch rises with the
square of the wire length. Finally, if the perturbations are each fixed-size or uniformly
distributed (e.g., a higher supply voltage for a section of the chip) , the worst-case
offset increases linearly with wire length.
Because gradients dominate over relatively long distances, it would probably be
most accurate to model short nearby wires with independent segments, long distant
wires in terms of gradients, and intermediate wires linearly. However, that obfuscates
the analysis unnecessarily; the key point is that short near wires match better than
long distant wires. For the sake of analysis, we will assume that uncertainty scales
linearly with delay with a mismatch coefficient a, as p(x) - p(0) . ap(O).
This argument can be extended to say that the variability in delay along a path
scales linearly with the delay along the path; that is, that there is a fixed percentage
error in on-chip path delay. We will use this assumption, although there is an impor-
tant caveat: a depends on the construction of the path. A Ins delay with a = 0.11
gives more skew (110ps) than a 1.lns delay with a = 0.09 (99ps). For this reason the
classic line-driver optimization may give suboptimal results if wire mismatch is not
the same as buffer mismatch. However, for the optimal combination, delay variability
will scale linearly with delay.
Of course, matching is not perfect for adjacent wires or devices either. Strong
sensitivity of threshold voltage and saturation current on L at short channels also
limits matching for minimum-size devices; typically saturation current has a 3% mis-
match for minimum devices, and matching down to 1% is straightforward in larger
devices. Local mismatch is an important limit for phase detector offset in PLL and
DLL systems.
Time-varying effects include capacitive and inductive coupling between signal and
34
clock lines and signal-dependent capacitance. Careful layout can minimize the ca-
pacitance between signal lines likely to switch near clock edges and clock wires, but
signal coupling is still important because it can be a significant source of jitter. We
will assume that up to 5% of the capacitance of any wire may transition during the
time a clock edge propagates.
Temperature changes on a chip are generally many orders of magnitude slower
than the clock speed, and are therefore reasonably treated as static gradients. On
the other hand, supply voltage can change within a single clock cycle in response
to changing load current. For this reason, temporal correlation is important when
matching elements that depend on supply voltage. An example where this is signifi-
cant is described in Section 2.4.4.
2.4 Clock Architecture Comparison
While a number of authors have considered the impact of variations on clock perfor-
mance, most assume tree distribution [52, 41, 63]. This section establishes a common
metric and compares several clock architectures.
2.4.1 Clock metric
The three categories of mismatches listed above cover what is needed for a first-order
comparison of clock networks. For normalization, each is scaled to distribute a 1 GHz
clock to a total of 200pF load capacitance over a 2cm chip in a standard 0.25pm
CMOS process. A clock wire in a TSMC 0.25pm CMOS process would be 1pm wide,
have a resistance of about 0.07Q/pm, and a capacitance of .lfF/pm.
It would be convenient to choose a single parameter to characterize clock networks.
As discussed earlier, skew and jitter are in general functions of both position and
time. It is appropriate to consider the worst case clock uncertainty over time, but
meaningless to look at worst case across a chip: in all practical cases a signal that
takes longer than a clock cycle to propagate would be pipelined, and hence re-clocked.
Hence, clock uncertainty between points on a chip further apart than one clock cycle is
35
.05C
Figure 2-8: Schematic model of capacitive coupling
irrelevant. For this reason, the metric for clock quality will be taken to be worst-case
clock mismatch over a distance corresponding to signal propagation distance during
one half of a clock cycle.
2.4.2 Tree
Propagation delay along an H-tree can be split into delay from the root to the leaves,
and delay from the leaves to a sub-block or tile. Delays to loads from a leaf are
generally not matched, so the entire delay in a sub-block adds directly to total skew;
this is sometimes called internal clock skew [14, 63]. The point of an H-tree, however,
is to match delays from the root to the leaves, so those delays are nominally matched,
and only variations contribute to skew. Consider a 8-level H-tree (i.e., one with
28 = 256 leaves). Assuming equal-sized buffers along the tree, these buffers would be
placed at intervals of perhaps 2mm, for a total of 10 segments.
Delay along the tree in this example is simulated to be 0.86ns. Assuming a = 0.1,
skew caused by gradient mismatch is 0.86ns x 0.1 = 86ps. Internal skew (Si) is no
larger than 0.07Q x 625pm x 0.2pF ~ 9ps.
Capacitive coupling adds a time-varying offset. Fig. 2-8 shows the schematic
model used to test the effect of capacitive coupling. The effect may be estimated by
adjusting the effective line capacitance for the Miller-multiplied coupling capacitance.
In the current example, the line capacitance is 200fF, the output capacitance of the
driving buffer is 34fF, and the input capacitance to the receiving buffer is 77fF. A
signal making a transition in the same direction as the clock lowers the effective wire
36
capacitance by 5% (given the assumptions above), so the delay should decrease by
.05x200 ; 3%. Conversely, a signal transitioning in the opposite direction will slow200+ 111
down the clock by the same 3%, so the total would be up to 6% variation. (Simulation
indicates the total variation is 5%). This component of uncertainty - skew if the
interference recurs on every clock cycle, jitter if it is inconsistent - also scales with
the total delay along the tree, and so adds a worst-case 45ps to clock uncertainty.
To sum up, a clock distributed by a tree as described above will have skew of 140
picoseconds, or 14% of the clock cycle; this is in line with industrial results given the
speed and assumptions about the process.
Generalization
We can generalize from this example to other trees. Fig. 2-9(a) shows how the two
components of skew change with the depth of the tree, n. (The tree of this example
had n = 8.) As argued above, both mismatch and coupling cause skew proportional
to wire length L from root to leaves of the tree; in units of chip size, L = 1 - (1/2)n/2.
Internal skew scales inversely with the area2 of the resulting patch, so Si oc 2-.
The other key parameter is power. Power scales linearly with switched capaci-
tance, so the clock distribution power (excluding the load) scales as 2n/2. Fig. 2-9(b)
combines the results into a plot of the fundamental clock network tradeoff between
power and performance.
Scaling
Note, however, that a clock tree does not scale well with process technology. As
chip dimensions shrink, wire delay (T) is, at best, constant. Total chip size is also
nearly constant. However, clock speeds increase as the gate delay decreases. Delay
along the clock net also speeds up, but not by the same factor. Along an optimally
buffered line, the ratio of gate delay (d) to T is constant, so as d falls, the distance
between buffers decreases. Wire delay is proportional to the square of the wire length
2Strictly speaking, it scales with length squared, but that is equivalent to area for non-pathologicalpatches
37
10 4 100-x- area-scaled skew 0-&- length-scaled skew -2
U-- total 0
2US10 2 10 - -
co 0
C N
10 1s -210 E 100 0
0
10 10 10 102 10 10depth of tree skew, ps
(a) Skew components in a tree vs. tree depth (b) Power vs. skew for a clock tree
Figure 2-9: Clock tree tradeoffs
between buffers (1). Hence 1 cx Vd. The total number of segments is proportional to
1/1, so the total delay along a tree is proportional to d/Vdi = v /d. Since the clock
speed is directly proportional to d, skew as a fraction of the clock period will grow
as 1/v d as gate delay falls. In other words, without a dramatic redesign or process
improvements, a 4GHz clock tree would have unpredictable clock skew of 30% of a
clock period, and a 16GHz clock would have to budget over half of the clock period
for skew and jitter margin.
Note that as clock speed increases, signal delay across a chip exceeds a single
clock cycle. In the example above, a 2cm-long wire has a delay of 0.86ns with 1GHz
clocks. Scaling to 4GHz, the same wire (with optimal buffering) will have a delay of
approximately 0.43ns, compared to a clock period of 0.25ns. Given the metric defined
in Section 2.4.1, therefore, there is no reason to minimize global skew at all. In a tree,
however, the worst-case skew occurs between nearest neighbors, so tree distribution
cannot take advantage of the relaxed global constraints. This is the fundamental
reason why trees become less attractive at high clock speeds.
38
Global Clock
Figure 2-10: Grid distribution block schematic
2.4.3 Grid
A pure grid network would have a single, central driver for the entire chip and a mesh
of clock wires. Skew would be simply the wire delay across the chip, just as it is the
wire delay in a patch for each leaf of a tree. In the limiting case, a clock plane with a
central driver would give skew of .07Q/pm x .lf F/um x (104pm) 2 = 0.7ns.3 Clearly,
a single driver will not give adequate performance, so modern grids are H-tree-grid
hybrids: a short H-tree distributes clock to a few (4 or 16, for example) buffers around
a chip, and those buffers drive a clock grid in parallel, as shown in Fig. 2-10. The
final patches are larger than those typical of trees, but the grid helps eliminate skew
caused by the tree distribution by shorting together outputs of multiple buffers.
Take as an example system a 4 level (24 = 16 node) clock tree where the final
buffers drive a global grid. Following the example of the previous section, such a tree
would have 7 2mm-long segments and an expected clock uncertainty of 70ps. Delay
across each region, assuming a lumped model with minimum-width wires, would give
a skew of 2.5mm x 70Q/mm x 6.25pF ~ 1ns. Because this skew is dominated by
wire resistance and load capacitance, it can be reduced by increasing the width of the
wires at the cost of increased power. At the point where the capacitance of the wires
3Scaling this value down to the size of the first Alpha gives skew ~ 200ps, which was reportedfor that chip.
39
Figure 2-11: Model circuit for shorted grid drivers.
equals the load capacitance there is one clock wire every 200pm, and the expected
wire skew is 89ps, (85ps simulated).
Furthermore, shorting the buffers together helps drive down some of the uncer-
tainty at the cost of increased short-circuit power during switching and somewhat
slower edge rates. A simple circuit model for a grid driven from multiple points is
shown in Fig. 2-11. Simulations with an 70 picosecond skew on buffer inputs show
a total skew of 145ps, of which 55ps is due to the input skew. It is possible to keep
driving this lower by increasing wire width; however, the benefits of wider wires get
incrementally smaller as the wire capacitance comes to dominate the total. Doubling
the wire width again, for example, lowers total skew to 110ps, of which 34ps is due
to the input.
The drawback, of course, is the power dissipation. The extra wiring needed to get
110ps skew down added 25pF of capacitance per buffer, while the clock load per buffer
is only 12.5pf. Still, grid distribution is used because much of the skew is predictable
and, unlike with H-trees, the clock design is largely independent of floorplanning.
40
100
00o 075 10
0
101N
S10'
0CL10-3
101 102 103
skew, ps
Figure 2-12: Power vs. skew for a grid.
Generalization
The primary parameter for a gridded clock is the capacitance of the grid (C); that
sets both the power dissipation (P oc C) and the wire skew. Si is proportional to
1 + CL/C where CL is the load capacitance and C the grid capacitance. Mismatch-
induced skew is shorted out by lower-resistance wires, so that component of skew falls
as 1/CL. A plot of simulated power dissipation vs. skew, corresponding to Fig. 2-9(b)
is shown in Fig. 2-12.
Scaling
Grid distributions depend only on wire delays. As mentioned above, wire delays tend
not to improve with process technology scaling. As the skew budget decreases with
rising clock speed, a grid clock must either increase capacitance or subdivide the chip
further with a deeper initial clock tree. In the example above, the initial tree itself
does not add significant power, so an obvious scaling strategy would be to simply
make larger trees to minimize Si.
As long as delay variations in the initial tree are comparable to rise time, deeper
trees and smaller Si will improve performance. However, rise time scales linearly
with d, so by the same reasoning as as applied to the tree scaling arguments, skew
41
as a fraction of rise time will increase with 1/vd as gate delay falls. When the tree
skew exceeds rise time short circuit power dissipation increases rapidly, and the clock
edges begin to show an unacceptable kink. Fig. 2-13 shows simulated edge shapes
with increasing input skew for a grid driven from a 4-level tree with skews from 0 to
200ps, and Fig. 2-14 shows the corresponding short circuit power dissipation.
DCWAO:v) y-
D0: V(xbs1) -
3.2
3
2.8
2.6 -
2.4
2.2
1.8
1.6
1.4 -
1.2
1T
800m -
400m
200m
0
-20Cm -
3.6n 3.65n 3.7n 3.75n 3.8n 3.85n 3.9n 3.95nTime (fin) (TIME)
4n 4.05n 4.1n 4.16n 4.2n 4.25n
Figure 2-13: Simulated edge in a grid with skew to the drivers.
2.4.4 Active Feedback
As is evident from the sections above, an increasing share of skew comes from the
initial long-distance distribution of a clock to relatively small loads. A delay-locked
loop (DLL) could be adapted to measure and cancel out wire variations. One possible
implementation is shown in Fig. 2-15, where a DLL is used to implement a single wire
with low effective delay. The intuition is that the delays are adjusted symmetrically
until the round trip time from the source to the load and back is a known multiple
of a clock period; (in line with the examples so far, assume the round trip time is
42
edge shape with input skew
0.5
0 0.4-
> 0.3
0c_00.2a)N
E0.10
0 50 100 150 200input skew, ps
Figure 2-14: Short circuit power in a grid vs. input tree skew.
Source D/2 W1 b2 w2 bw13 w3 b4
Load
b8 w7 b7 w6 b6< w5 b5
Figure 2-15: Low-skew wire with DLL
2ns, which is 2 clock periods). Then by symmetry, the signal arrives at the load
with a 1 period clock delay, which means it has effectively 0 delay for clock signals.
Unfortunately, this intuition is misleading.
Despite the apparent symmetry, there is little reason for the forward path to
match the reverse path in this connection for two main reasons. First, the nominally
matched buffers are physically separated. In Fig. 2-15, b1 should match b7 , although
it would be physically near b8 . b, isn't as far away from its matched pair as it might be
in a tree, but it will still typically be millimeters away. Second, there is no temporal
correlation. The clock signal passes w, at a different time than it passes w7 , so
any time-dependent variations, including those due to power supply and capacitive
coupling, do not match. Taking the results from Section 2.4.2, the effective skew for a
1cm-long DLL wire would be ~ 90ps, which is only a 30% improvement over a simple
43
Global Clock
Figure 2-16: Matching tree leaves with a DLL
wire, and that does not count offset in the comparison of the two edges or mismatches
in the delay cells.
Another approach, more like a traditional DLL, is shown in Fig. 2-16. The global
clock is distributed to two half H-trees, a phase comparison is done at the leaves, and
a variable delay is adjusted to align the clocks. The technique is meant to balance
delays along path 1 (di) and path 4 (d4 ) in this example. Note, however, that while
nodes A and B may be matched, nodes C and D are not; the mismatch between
nodes C and D (mcD) is (d + d3) - (d4 + d6) . The loop drives d, + d2 = d4 d5 SO5
mcD (d- -2)- (d- -), which is somewhat smaller than it would be without the
DLL (in which case moD =(d, - d4 ) + (d3- d6)) because W2 and w5 are both closer
together, and shorter, than d, and 4.
An immediate generalization would be to break up the trees further, have two
more comparators, and variable delay elements, as in Fig. 2-17. (Note the difference
between Fig. 2-17 and Fig. 2-18. The latter generalization requires matching between
delay elements D2 and D5, and between D 3 and D6; the former does not require that
the delay elements match at all.) Because delays to the leaves are controlled by DLLs,
the top-level tree structure is no longer necessary; Fig. 2-19 shows a DLL distribution
where each DLL drives a local tree. Static delay variations of nearest neighbors are
cancelled out by the DLL to within the precision of the matching of the comparators.
44
Global Clock
1 4A U B
D 2 5 D
DC
1 Cj
3 6
C D
Figure 2-17: Matching tree leaves with two DLLs
Global Clock
7Compare 7 E 4
D2 D5
D3D6 8-r F
CompareI I
Figure 2-18: Matching tree leaves with a two DLLs which requires delay cell matching
45
Global Clock
Compare
Compare
Delay Dela
Compare Compare
Delay Delay
A B
Figure 2-19: DLL architecture
Dynamic variations, due to supply noise or signal coupling, however, persist; two
1cm-long paths with active DLL matching will have a relative jitter of approximately
50ps (all of it time-varying), and skew from mismatch in the phase detectors, and
some mismatch from distribution along local trees. A typical phase detector has a
delay equal to 2 inverters, and its two halves are physically close together, so skew
is expected to be approximately 2 x 5% x d ~ 10ps. As drawn, the maximum skew
in the network is not between two paths connected with a DLL; rather, the skew
between A and B is the sum of the skews through three DLL's (10ps each) and four
local trees (25ps each). Total clock uncertainty between A and B, then, is 180ps and
the scaling is even worse because the effective distance between two nearby points
grows rapidly as the number of DLLs increases. A much better result can be obtained
by using DLLs that take multiple reference inputs, and adjust output phase to be
aligned exactly between the two inputs. The network can then be redrawn somewhat
more symmetrically, as Fig. 2-20. (For clarity, the local tree was not drawn, and the
connections to the comparators are abstracted.)
Optimization of the number of the number of tiles is straightforward. As argued
previously, internal skew scales with tile area, so as the number of tiles increases,
internal skew falls. However, every boundary between tiles introduces some skew
46
Global Clock
............................. ........ ......
Delay o a e Delay... ...... ...... . ..................................... ................. ............... ....... .......... .......
............................. ............ .. . . . . . . . . . . . . . . ...........I ...................... ... ...................... .............
.......................Compare Compare........................... ............... ..... ......... .......
....... ..................................... ............................... ............................................................. ....... ...Delay Compare Delay
Figure 2-20: Multi-input delay cell DLL architecture
100
C.
-. )
o
80-
60
40
20-
0
)
1 4 9 16 25 36 49 64number of tiles
Figure 2-21: Tile number optimization
because of mismatch in the phase detector. Hence, as the number of tiles increases, the
number of boundaries increases. Fig. 2-21 shows the optimization curves calculated
for this clock metric.
One inherent weakness of DLL networks is that DLLs are inherently sensitive
to input jitter. A phase-locked loop, (PLL), though somewhat more complicated in
implementation, filters out noise on the inputs. PLLs and DLLs are nearly identical
structures in isolation. Each has a variable delay element as a core, represented in
Fig. 2-22(a). An input signal with phase 0 is delayed by some time A and output with
phase q. In both the DLL and PLL cases (Fig. 2-22(b) and Fig. 2-22(c)), A = - 0.
The only difference is where the input signal comes from. If the input to the block is
47
-x- area-scaled skew-e- boundary skew_g_- total
ApA
A t
(a) Variable delay block (b) Delay-locked loop (c) Phase-locked loop
Figure 2-22: A variable delay element and phase comparator can be configured intoa DLL or a PLL.
0, the system acts as a PLL; if it is 0, a DLL. The noise and stability implications of
the feedback will be considered in the next chapter.
Scaling
As in other clock networks, faster clocks require a more finely-grained architecture.
Jitter in a DLL network will rise in exactly the same way as it increases in clock
trees, and for the same reasons. Skew scales linearly with d because it is comprised
of comparator mismatches and delays across each leaf-patch. Note, however, that
in a PLL the noise can be expected to scale with d; a PLL network like the one in
Fig. 2-20 would have total clock uncertainty that is a constant fraction of the clock
period.
48
Chapter 3
Synchronization and Stability
The purpose of an on-chip clock is to synchronize computation. Distributed networks
make explicit this synchronization. Chapter 2 argues that the performance of dis-
tributed clock networks scales favorably with clock speed (or at least does not scale
as poorly as do clock trees). This chapter gives some background on synchronization
architectures and then considers the synchronization of multiple oscillators.
3.1 Previous Work: Synchronization
The are two main synchronization schemes. In the first method, handshaking guar-
antees that computation proceeds in the correct order, although independent process
are not synchronized in any way. In the latter method, a global clock is used to syn-
chronize data, but the generation of the global clock is split among multiple blocks
that must align their respective clocks.
3.1.1 Local Data Synchronization
The earliest distributed networks dealt with synchronization of data explicitly, rather
than of multiple clocks. The archetypical example of this is large processor arrays.
It has been suggested that the computational density available in modern VLSI be
used to build large arrays of simple processors which communicate only with nearest
49
neighbors [21, 20, 15, 16]. Since skew is only relevant between communicating proces-
sors [7], trees do not seem well suited to the problem: there is no reason to eliminate
global skew as long as the clock skew between neighboring processors is low. This can
be accomplished by having each processor synchronize directly with its peers.
So-called self-timed systems use handshaking between the blocks for synchroniza-
tion [21, 41]. Each communication path between two blocks is accompanied by extra
signals that implement some manner of flow control. For example:
1. The processor sending data puts the data on the wire and asserts a Data Ready
signal.
2. The receiving processor reads the data and then asserts a Data Accepted
signal.
3. Data Ready is unasserted.
4. Data Accepted is unasserted.
Because no global synchronization is needed, self-timed systems are an example
of an asynchronous system. Such systems have several advantages over globally syn-
chronized systems: there is no global clock to propagate, and each block can work at
its actual speed rather than the global worst-case clock speed [21]. However, there
are several significant drawbacks: there is circuit overhead in generating the local
synchronization signals; the designs are notoriously hard to analyze and test; and
often the system operates at the worst-case time anyway, because computation is
always limited by the latest input [15, 41, 42]. The approach suggested by El-Amawy
[16] avoids some of these problems by having a system that looks fully synchronous,
albeit with some local clock skew. However, there is still no global synchronization,
and communication is only allowed between neighboring processors. Despite these
drawbacks, asynchronous systems are an alternative to global clocking, and may be-
come more prevalent if the prospects of very high speed clock distribution are not
improved.
50
Clock Signal
Node 1
12 Node 2
Node 3
Node 4
Time
Figure 3-1: Mode-locking example
3.1.2 Local Clock Synchronization
The proposed clock distribution architecture is organized as a synchronous array.
That is, clocks are generated at multiple places over the chip and controlled to have
the same phase and frequency. This approach has not been used in integrated clocks,
but it has been proposed for parallel computers, and some of the issues are similar
[40]. Pratt and Nguyen suggest constructing a clock for a parallel computer from
synchronized, voltage-controlled quartz crystal oscillators. Phase detectors and inte-
grators generate phase error signals, and these are used to pull the crystals to the
same phase and frequency.
While the desired, phase-locked configuration can be proven stable, it is possible
that some arrangement of unequal clock phases is also stable on a given network;
this effect is known as mode-locking. In the simplest example, a system consisting of
four nodes is stable although the phases are not equal, as shown in Fig. 3-1. Each
node sees one neighbor leading and one lagging, and therefore doesn't adjust. The
authors show that mode-locking can be avoided in a regular mesh with nonlinear
phase detectors, which they implement as balanced XOR gates.
This architecture is inconvenient for on-chip clock distribution for several rea-
sons. First, modern microprocessors are not organized as regular structures inter-
nally; memory caches and ALUs have vastly different clocking needs. Therefore it
will be necessary to remove the constraint that the clock nodes form a regular array.
51
Second, this method depends on having relatively noise-free, well-matched crystal os-
cillators, but such oscillators are not available on chip, and what is available has much
worse short-term stability. Therefore, the phase comparators and stabilization net-
work must be completely redesigned to compensate for the noisier oscillators. Third,
they assume that wire delays between nodes are negligible; on an IC, these delays are
the very heart of the problem.
3.2 Proposed Clock Architecture
The proposed distributed clock network is an array of synchronized PLL. Independent
oscillators generate the clock signal at multiple points ("nodes") across a chip; each
oscillator distributes the clock to only to a small section of the chip ("tile") (Fig. 3-2).
Phase detectors (PD) at the boundaries between tiles produce error signals that are
summed by an amplifier in each tile and used to adjust the frequency of the node
oscillator. In general, the network need not be square or regular.
With locally generated clocks, there are no chip-length clock lines to couple in jit-
ter; skew is introduced only by asymmetries in phase detectors instead of mismatches
in physically separated buffers; and the clock is regenerated at each node, so high
frequency jitter does not accumulate with distance from the clock source. Unlike
earlier work on multiple clock domains which suggested the use of multiple indepen-
dent clocks, this approach produces a single fully synchronized clock. The rest of this
chapter examines small and large signal stability of a distributed phase-locked loop.
3.3 Small Signal
In a multiple-oscillator PLL large- and small-signal behavior are interrelated. In
normal operation, the oscillators are phase-locked, and jitter depends on the network
response to noise. Because startup is expected to take a negligibly small fraction of
time, the connection of the oscillators is optimized for small-signal behavior rather
than to make initial acquisition more efficient. The linearized small signal behavior,
52
valid when the oscillators are nearly in phase, is analyzed first.
3.3.1 General Derivation
A traditional phase-locked loop (PLL) consists of three components: a voltage con-
trolled oscillator (VCO), a phase detector (PD), and a low-pass loop filter, connected
as shown in Fig. 3-3. In a digital application like clock generation, the output of the
oscillator is a square wave, and the phase detector generates a signal that on average
is related to the difference in phase between two square waves. Clearly, both the
oscillator and the phase detector are nonlinear in a strict sense. However, there is an
approximately linear relationship between the input voltage of the oscillator and the
phase of the output square wave. The relationship between the input phase difference
and averaged output of the phase detector is also linear. Hence, the system can be
modeled as a linear feedback system Fig. 3-4. The system as drawn in Fig. 3-4 is
described by:
aHi(s)- (u - ) (3.1)
= aH(s)/(s + aH(s)) u (3.2)
where u is the input phase. The poles of the system are the solutions of
aH(s) + 1 = 0 (3.3)
Substituting H(s) = (s + z)/s into Eq. 3.3 gives
a(s + z) + S2 = 0 (3.4)
which is a familiar result for a simple phase locked loop.
Exactly the same analysis applies to a network of coupled oscillators. Consider a
set of interlocked PLLs, as shown in Fig. 3-5.
The network can be modeled as a multivariable linear system; in fact, the block
53
Chip Boundary
ile Boundary
Phase
Detector
Loop Filter&vco& VCOj
Figure 3-2: Distributed clocking network
Reference timer-CLooptput
PDFilter Otu
Figure 3-3: Standard phase-locked loop.
Loop Filter VCOPD
Reference Output
s s
............ (voltage) ---.--..
(phase)
Figure 3-4: Linear system model of a standard phase-locked loop.
54
Reference L r F L r VCOPD ---- 1 FitrPD ----- 0 Fle
Loop VC0 Loop VCOPDFilter PDFilter
Figure 3-5: Multi-oscillator phase-locked loop
PD Loop Filter VCO
Reference N Outputj 21- A -- *, A2 *h ( s) N a
N
Figure 3-6: Linear system model of a multi-oscillator phase-locked loop
diagram (Fig. 3-6) is essentially identical to the one for a single oscillator system,
except that the connections between blocks are vectors instead of individual signals,
and the gains and transfer functions are matrices instead of scalars. This means that
the phase detector becomes a matrix A1 of size N(N + 1)/2 x N instead of a single
subtraction, and the loop filter becomes A2, a corresponding N x N(N+ 1)/2 matrix.
G = A2A1 is an intuitively meaningful N x N matrix. The network of oscillators
is similar to a lumped circuit C with a node for each oscillator and a branch for
each connection between pairs of oscillators. Node voltages in C represent oscillator
phase, and branch currents represent the error signals on the output of the phase
detector. G is the conductance matrix for C with unity conductance branches. G for
a 4 oscillator network is shown in Eq. 3.5. Each off-diagonal entry gij is -1 if there is
a phase detector between node i and node j; gij is the number of detectors attached
55
to node i.
3 -1 -1 0 '
-1 2 0 -1G = (3.5)
-1 0 2 -1
0 -1 -1 2
DC gain in the loop can be lumped into a3 .
Recasting Eq. 3.1 in matrix form gives Eq. 3.6,
4b = [sI + a3A 2Aih(s)]-' h(s)a3A 2U (3.6)
where u is now the phase error input to each phase comparator. In other words, u(1)
is the reference phase, and u(2) ... u(n) are the noise contributions from interconnect
and phase detector mismatch.
3.3.2 Examples
Matrix A1 is determined by the geometry of the tiles, and hence will constrained by
the placement of clock loads, which for this problem is fixed. Assuming the simplest
possible phase-locked loop, h(s) = (s + z)/s. This leaves A2 , a3 , and z as design
variables.
There are still far too many choices to find the general optimum, but a few exam-
ples may help guide the search.
Single oscillator
The reference design is a single-oscillator phase-locked loop. Stability constraints of
a single oscillator PLL may be derived directly from Eq. 3.3; however, it is more
common and more intuitive to analyze the loop gain, ah(s)/s. Magnitude and phase
Bode plots of the loop gain are shown in Fig. 3-7. Note that because of sampling at
the phase detector, the continuous time approximation is only valid for frequencies
much lower than the oscillator frequency. The Bode plots below add multiple parasitic
56
poles at the clock frequency we, to model the phase effects of the sampling. For the
0 -90
00
0000
-18000
Z 0io O
z (00 ) C log (P) log(O))
(a) Loop gain magnitude (b) Loop gain phase
Figure 3-7: PLL loop gain Bode plots
PLL to be stable and sufficiently damped, the phase must be above -135 when the
loop gain is at OdB. This means that the unity-gain frequency, wo, should be much
lower than w, and that the zero, z, should be much lower than wo. The location of
the dominant pole is not critical to the stability.
For a typical 1GHz oscillator, a = co ~~ 330MHz, consistent with the constraint
wo < we. In turn, this puts an upper limit of 50MHz on z. Fig. 3-8 shows the root
locus for this PLL over a gain error from -50% to 100%.
One dimensional array
A one-dimensional array of oscillators with phase detectors between neighbors is the
first generalization of a single PLL. In a perfectly asymmetrical array (call this system
S1 ), the output of PLL i is the input to PLL i+1, as shown in Fig. 3-9. S is described
by
1 0 0 0 1 0 0 0
-1 1 0 0 0 10 0A1 = A 2 ,1 (3.7)
0 -1 1 0 0 0 1 0
0 0 -1 1 0 0 0 1
57
x 10 7
6 -
4 -x
u) 2- x-<C x< n 0 K< - X - -. . . - X > 0 0 x x. .. .. O -Mx
EX
-4 -
-6
-1.5 -1 -0.5Real Axis x 108
Figure 3-8: Root locus for single-oscillator PLL with gain error
N
Ref
P
Figure 3-9: Asymmetrical one-dimensional PLL array
58
This system has multiple poles at the same place where a single-oscillator PLL has
single poles.
On the other hand, in a perfectly symmetrical array (call it S2 ), the input to each
oscillator i is the phase of oscillators i - 1 and i + 1 (Fig. 3-10). The A1 matrix is the
N
Ref
P
Figure 3-10: Symmetrical one-dimensional PLL array
same because the physical arrangement of nodes is identical, but A2 changes:
1 -1 0 0
0 1 -1 0A2 ,2 = (3.8)
0 0 1 -1
0 0 0 1
To achieve the same phase margin in S2 as in S1, it is necessary to lower the gain a 3.
This can be shown with a geometrical argument: in S2, when the phase of oscillator
i changes by A0q, the change is measured at two phase detectors, so oscillator i feels
twice the feedback that it would have felt in S1 , and at the same time, oscillators
i - 1 and i+ 1 both adjust in the opposite direction, giving 4 times the effective gain.
Hence, the gain must be decreased by a factor of approximately 4. Mathematically,
the largest eigenvalues of A 2 ,1 A 1 is 1, but the largest eigenvalue of A 2 ,2 A1 is 3.5.
Poles of the symmetrical system, solved via Eq. 3.61 are plotted in Fig. 3-11. The
'While it is possible to use Eq. 3.6 directly, it is often more convenient to take advantage of the
59
3
2- x
1 --
xOK X x x xI
x
-1--
-2 x
-3-6 -4 -2 0
Figure 3-11: Root locus for a one-dimensional array of PLLs.
60
key difference between Si and S2 is the systems' response to noise. In both cases,
noise at frequencies higher than the unity gain frequency wO are attenuated. For
frequencies much lower than wo, the response can be calculated via Eq. 3.6. Fig. 3-
12 shows a Bode plot of noise at node P in response to a noise source at node N.
Noise performance of Si is much worse for intermediate frequencies because there is
Noise
0- ------ ------
-10-
-20- symmetrical
-30 - - - asymmetrical
-40.
Freq0.001 0.01 0.1 1
Figure 3-12: Comparison of noise responses for symmetrical and asymmetrical net-works
no feedback so errors propagate forever. In S2, the feedback limits the influence of
preceding stages, and this in turn attenuates noise. For this reason, networks with
feedback are preferred, despite the more complicated stability calculation.
Two dimensional array
A two dimensional array is analyzed exactly the same was as is a one-dimensional
array, except that the gain has to decrease by another factor of two because the center
oscillators see four neighbors rather than two. A 16-element array in a 4 x 4 grid is
simple form of h(s), and rewrite the zero-input state equations thus:
S ' 0 I 0 10#' = 0 0 I 0' (3.9)
$"-Gz -G -pI ) "1
61
implemented in this thesis. Its G matrix and poles are shown below.
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0)
1 -3 1 0 0 1 0 0 0 0 0 0 0 0 0 0
0 1 -3 1 0 0 1 0 0 0 0 0 0 0 0 0
0 0 1 -2 0 0 0 1 0 0 0 0 0 0 0 0
1 0 0 0 -3 1 0 0 1 0 0 0 0 0 0 0
0 1 0 0 1 -4 1 0 0 1 0 0 0 0 0 0
0 0 1 0 0 1 -4 1 0 0 1 0 0 0 0 0
0 0 0 1 0 0 1 -3 0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0 -3 1 0 0 1 0 0 0
0 0 0 0 0 1 0 0 1 -4 1 0 0 1 0 0
0 0 0 0 0 0 1 0 0 1 -4 1 0 0 1 0
0 0 0 0 0 0 0 1 0 0 1 -3 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0 -2 1 0 0
0 0 0 0 0 0 0 0 0 1 0 0 1 -3 1 0
0 0 0 0 0 0 0 0 0 0 1 0 0 1 -3 1
0 0 0 0 0 0 0 0 0 0 1 0 0 1 -2)
(3.10)
3.4 Large Signal: Mode Locking
The analysis of the previous section indicates that fully-connected networks should
have a better noise response than asymmetrical networks. However, the feedback
allows the possibility of undesirable large-signal modes. Consider the network of
62
I
0
3
2 [
1
00xx xx
-1
-3'-6 -4 -2 0
Figure 3-13: Root locus for a two-dimensional array of PLLs.
63
x
xx
xx
X
1 2
4113
Clock Signal
Node 1
Node 2
Node 3
Node 4
Time
Figure 3-14: Mode-locking example
Fig. 3-5, and its associated matrices:
/ -1
1
1
0
0
0
-1
0
1
0
0
0
-1
0
1
0
0
0
-1
-1 /
A 2 = A =
Because phase is periodic with period 27r, the p
tors A0 = A 1# mod 27r. For small 0, (A1 # mod
irrelevant. However, consider #,, = [0, 7r/2, -7/2,
-1 1 1 0 0
o -1 0 1 0(3.11)
0 0 -1 0 1
o 0 0 -1 -1
hase measured at the phase detec-
2 -) = A10, so the nonlinearity is
7r]T. Because of the nonlinearity,
A 2 (A1 # mod 27r) = A 2 [0, -r/2, r/2, -7/2, 7r/2]T = 0 (3.12)
so 0_, is a stationary point. This is intuitively easy to see, in reference to Fig. 3-14:
each oscillator leads one neighbor, and lags behind another neighbor by exactly the
same amount. The net phase error is zero, so clearly there is no restoring force to drive
the oscillators into phaselock. Furthermore, this equilibrium point is stable, because
the nonlinearity does not change for small deviations from 02 so dynamics about 0-
are the same as those about 0. The locking of a distributed oscillator to non-zero
relative phases has been called mode-locking [40]. At startup, each oscillator in a
64
distributed PLL starts at a random phase, so there is a nonzero chance of converging
to a mode-locked state. Simulations show that for a network like the one shown here,
the system ends modelocked from ~ 1/3 of random initial states. The probability
goes up rapidly with the the size of the system; a 4 x 4 array ends up modelocked
well over 99% of the time.
Pratt and Nguyen proved several useful properties about systems in mode-lock.
The lemmas and theorem are repeated here with outlines of proofs, generalized to
include arbitrary (rather than Cartesian) networks.
Consider a system of oscillators to be a circuit, with oscillators at the nodes,
and connections between oscillators to be branches. (This is the same model as was
presented in Section 3.3.1). The phase counterpart to Kirchhoff's Voltage Law is:
Lemma 1 The sum of branch phase differences must be a multiple of 27r.
The sum is a multiple of 27r rather than 0 because phase differences here are defined
over a range [-7r, 7r), so at any branch 27r might be added or subtracted to bring the
result into the right range. For example, a phase detector will measure the difference
between 57r/6 - (-57/6) =wr/3, not 57r/3. This is true independent of mode-lock.
The second lemma derives from conditions for mode-lock: that is, the nodes are
in static equilibrium although the phases are not identical.
Lemma 2 If a set of oscillators is mode-locked, there must be at least one loop in
the network for which the sum of phase differences is a nonzero multiple of 27r.
The proof is as follows: in mode-lock, by definition, the nodes are not all at the
same phase. Therefore, there must be at least one node which connects to a branch
with nonzero phase error. Call that Node 1. Because Node 1 is in equilibrium by
definition of mode-lock it must connect to at least one branch with a positive phase
error. That branch connects to some Node 2, and appears as a negative phase error
there. Since Node 2 is also in equilibrium, it must have some other branch with an
offsetting positive phase error. Because there is a finite number of nodes, the loop
will eventually close back on Node 1. By Lemma 1, the sum must be a multiple of
65
27r. Because by construction, all the branches were positively-oriented, the sum must
be nonzero [40].
There are a number of ways to avoid mode-lock. The most obvious one is to simply
break the feedback: a consequence of Lemma 2 is that if there are no feedback loops,
there can be no modelock. This is not an attractive solution because, as shown in
the example with a one-dimensional array, full feedback helps average and attenuate
noise, so it would be best to avoid modelock without affecting the interconnection
of the system or the operation when correctly phase locked. One possible solution
would be to have a special startup state where there is no feedback between oscillators,
and then an operational state with full feedback. The system might be synchronized
during the startup, and then would remain phase-locked in the operational state. The
biggest drawback of this approach is that the the transition from the reset state to the
operational state jolts the system, and could push it into mode-lock. Thus, it would
be preferable to have a solution that does not require changing network topology even
temporarily. Fortunately, there is such a way.
If we define a minimal loop as a loop in the graph that cannot be decomposed
into other loops, we can combine the results succinctly into:
Theorem 1 For a system in mode-lock, there must be a phase difference 0 between
two oscillators such that 0 ;> 2/n where n is the number of nodes in the largest
minimal loop in the network.
By Lemma 2, there must be at least one loop (L) with a phase difference sum of at
least 27. If it has more than n nodes, it cannot be a minimal loop. Decompose L into
L1 and L 2. By Lemma 1, the loop sum around both L1 and L 2 must be an integral
multiple of 27, so at least one of them must have a loop sum of at least 27r; iterate
if necessary to get a loop of n or fewer nodes. Since the sum of the branch phase
differences must be 27r, at least one of the branches must have a phase difference of
at least 27r/n.
Theorem 1 suggests a way to distinguish between mode-locked states and the
desired 0-phase state: in mode-lock, there must be at least some large phase errors
66
across individual branches. If the gain of the phase detector is designed to be negative
for a phase difference larger than 0, then all mode-locked states are made unstable
without affecting the in-phase equilibrium. Pratt and Nguyen suggest that an XOR
phase detectors precludes modelock in a rectangular network of oscillators because the
response decreases for phase errors larger than 7r/2,[40]. This result follows directly
from Theorem 1: in a rectangular array, the largest minimal loop has 4 nodes, so
0 = 27/4 = 7r/2. Two other phase detectors are described in the next chapter, both
with 0 < 7r/2, which would be useful in non-rectangular networks, and where more
gain near 0 phase is desirable.
67
Chapter 4
Implementation and Testing
Distributed Clocks
Two test chips were made to explore implementation issues: how much power do the
oscillators require? How much area is needed for the compensation filters? Can a
real loop, with the buffer and wire delays be stabilized? The first was a 4-oscillator
chip in a 0.6pm double-poly CMOS process with a clock speed up to 350 MHz, and
the second was a 16-oscillator chip in a 0.35pam single-poly CMOS at clock speeds of
1.2-1.4 GHz. The two chips are described in turn below.
4.1 4 Oscillator Chip
The 4 oscillator chip was done as a proof of concept to show correct phase locking in
the simplest system that could possibly be vulnerable to modelock; a plot is shown
in Fig. 4-1 It consists of four nodes (each with an oscillator and loop filter) and
five phase detectors (one between each pair of neighbors, and one connected to an
external input). High-speed probes contact chip pads at the edges of the chip. One
probe drives the input, and the other three are connected to outputs of the oscillators.
(The probes are too large to connect more than one probe on a single chip side, so
all four oscillators could not be measured at the same time.)
69
4.1.1 Oscillator
The primary metric in the design of oscillators for clock generation is jitter, and
the majority of that is due to power supply noise [64, 65]. Integrated LC oscillators
often have a lower noise floor than other on-chip oscillators, but substrate and supply
noise are dominant on a large digital chip. Ring-type or relaxation oscillators are
usually preferred for on-chip clocks because large chips are usually sorted into different
categories based on measured achievable clock speed, and LC oscillators are more
difficult to tune. For this chip, a differential relaxation oscillator was chosen because
Hspice simulations showed that this relaxation oscillator had better power-supply
rejection than did ring oscillators. The relaxation current-controlled oscillator, or
"CCO," is shown in Fig. 4-2. Transistors M 3 , M 4 , M 5 , and M6 , along with capacitor
C make up a conventional source-coupled multivibrator, with M7 and M8 as active
loads and nbias controlling oscillation frequency through Id3,4. The drawback is that
that circuit has a feedthrough of -6dB to nodes V+ and V- from VDD, and almost
OdB to the capacitor from ground via Cbs of M 3,4 , so supply noise rejection is poor.
In the proposed oscillator, M1 and M2 provide shunt-shunt feedback around M 3 and
M4 respectively, lowering the output impedance at V+ and V- to 1/gm. D1 and D2
limit the amplitude of oscillation to avoid saturation of M 3 and M4 . Frequency can
be adjusted by adding common-mode current into nodes V+ and V-.
Oscillator layout is shown in Fig. 4-3. Layout for both halves of the oscillator
is identical, and the halves are immediately adjacent. Good matching between the
halves corresponds to a 50% duty cycle. Furthermore, all source/drain regions were
shared to minimize layout area and parasitic capacitance.
4.1.2 Phase Detector
As discussed previously, modelock can be avoided in regular arrays by using nonlinear
phase detectors whose response decreases monotonically beyond a phase difference of
7r/2 [40]. The phase detector Pratt and Nguyen suggest (a flip-flop delay and an XOR
gate) is not well-suited for integrated PLLs, however. First, it has relatively low gain,
71
Figure 4-3: Relaxation oscillator layout
72
... ......... ...... ......... ...... ......... ...... ............ ......... ...... ............ ......... ...... ......... ...... ......... ...... ......... ...... ......... ...... ......... ...... ......... ...... ......... ...... ......... ...... ......... .... ... ......... ......... .. .................. ......... ......... .. ......... ............ ......... ......... .. ......... ............ ....
...........................................
....................
.................................
so mismatch can lead to large input-referred phase offsets. Second, it generates full-
swing digital signals at half the clock frequency; this digital noise must be attenuated
in the loop filter.
The phase detector proposed here,
A rshown in Fig. 4-4, has the right nonlin-
pbias M7 M8 earity, higher gain at small A0q and has
much less high-frequency content than
D2 an XOR. The noise that is generated is
V+ V- at the clock frequency, and is attenuated
an extra 6dB given the same first-order
M3 M4 loop filter. (Only half of the circuit is
drawn. The other half is the symmetri-
M1 M2 cal counterpart, with clocki and clock2
switched.) M1 , M 2, and M3 comprise
an arbiter. The voltage at node A is
C buffered, sampled, and converted to a
current, so that multiple inputs can beM5 M6
nbias summed at each oscillator node. Syn-
chronous sampling of the arbiter output
by M 6 and M 7 demodulates it, removing
Figure 4-2: Relaxation oscillator schematic high frequency content. Timing wave-
forms are shown in Fig. 4-5. The phase of the sampling instant affects the transfer
function, shown in Fig. 4-6. Node A is the output of the arbiter. When clocki and
clock2 are nearly in phase, as is the case at sample periods 1 and 2, A is sampled while
its value is still valid, so the output Y goes from 0 to 1 over the width of the arbitration
window. Hence, the phase detector has a high gain near 0 phase difference. As the
phase difference increases, sampling instance timing becomes relevant. A is sampled
at a fixed delay from the rising edge of clocki. If clock2 falls before A is sampled, the
output Y will also fall, as shown for periods 3 and 4. Therefore, 0c, the phase angle
at which the output transfer function starts to fall, depends on the relative timing
73
U
Ml
A M5
Tick M6
M2
M7 "
I2 M34
13 T ___M4
I4I5 I6
N1
I7 I8 I9
M8
M9
M1 M12
M10
Figure 4-4: Phase detector schematic
of the falling edge of clock2 and the sample delay. If 0, is the phase of the sampling
instant and Of the phase of the falling edge, Oc = O - O, so the characteristic angle
could be adjusted easily simply by setting the delay through I ... 19. With 0, ~ 7r/2
and a 50% duty cycle (i.e., Of = ir) 0c would be ir/2, which is the constraint to avoid
modelock. Were smaller 0, needed to accommodate a different network structure, the
same circuit could be used with a different 0,. Adding the output from the unshown
half of the circuit gives the other half of the phase response, shown in Fig. 4-7. The
full circuit fits in 80pm x 40pam.
4.1.3 Loop Filter
One loop filter is associated with each CCO. Conventional loop filters use a charge
pump with an RC pole-zero pair, and often put the large capacitor and resistor off
74
Clockl
Clock2
A
Sample
Y
1 2 3 4 5
Figure 4-5: Phase detector timing waveforms
Iout
-7t 1 293 4C Phase
Figure 4-6: Sampled phase detector half-circuit transfer function
chip. To avoid inconveniently large resistor and capacitors, a feed-forward compensa-
tion method was used. The loop filter of Fig. 4-8 consists of two differential amplifiers.
(Note that because the frequency control to the oscillator consists of two currents,
both amplifiers have twin outputs.) M 3 , M 4 , M 5 , and M6 make up amplifier A 1 ,
biased by M!, while M1 , M 2 , M 7 , M8, M 1 and M 12 make up A2 , biased by M10 .
The differential output currents from the phase comparators at the edges of each tile
are summed at nodes I,-+ and fln- and drive both amplifiers. A1 is a single stage
differential pair, so it has relatively low gain but a bandwidth limited by gm3,4/Cs3,4,
since nodes Ioutl and Iout2 drive a low impedance. A2 has two stages, much like a
prototypical op-amp. The first is biased at very low current to give high gain at DC
and allow the use of a relatively small compensation capacitor, and the second pro-
vides the needed gain and isolates the high impedance pole from the output. In this
75
Iout.
-IL -o T Phase
Figure 4-7: Sampled phase detector full transfer function
amplifier, the DC gain was simulated at 31dB with a 16kHz pole, a compensating zero
at 7.6MHz, and a high frequency pole well above the PLL target frequency. The use
of feed-forward compensation allowed the use of very small capacitors; the loop filter,
including the poly-poly capacitor, and the CCO with its output buffers together take
up 88pim x 8 8pm.
M7 M8
M11 M12
I1Io2
M3 M4 M5 M6
PT1 I M1 M2I in- I1 I 2 I in+
M9 M10Vb2 Vb1
Figure 4-8: Loop filter schematic
76
4.2 16 Oscillator Chip
The 16 oscillator chip was a second generation chip with a number of improvements
over the 4 oscillator first generation. First, a larger network provides a more thorough
test of modelock-resistance, because modelock is more likely from initial startup than
in smaller networks. Second, a newer and faster fabrication process, 0.35pm, was used,
to test the ideas at clock speeds more appropriate for modern microprocessors. Third,
key circuits were redesigned: the oscillator is a ring oscillator instead of a relaxation
oscillator, and no longer requires two levels of polysilicon; the phase detector now
uses a much simpler arbiter-based design that gives phase and frequency feedback as
appropriate.
4.2.1 Oscillator
The second chip used an NMOS-loaded differential ring oscillator as a voltage con-
trolled oscillator (VCO) (Fig. 4-10) primarily because only one layer of polysilicon
was used, and diodes were disallowed in an effort to make the circuits more amenable
to implementation in standard microprocessor. Transistors M 4 - M8 comprise the
differential inverter. The differential pair is M5 ,8 , the tail current is driven by M6 ,
and M 4,7 act as the NMOS load. The NMOS loads allow fast oscillation and shield
the output signal from VDD noise. Vbias is a low-pass version Of VDD generated by
subthreshold leakage through PFET M1 ; supply noise coupling in through Cgd of M4 ,7
is bypassed by M2 . The oscillation frequency is only dependent on the supply voltage
through capacitor nonlinearity and the output conductance of M 4 ,7, and feedback of
the PLL compensates drift of VDD and Vbias.
4.2.2 Phase Detector
Just like the phase detector for the 4-oscillator chip, the second generation phase
detector, shown in Fig. 4-11, has a sufficient nonlinearity, higher gain at small in-
put phase difference and less high-frequency content than an XOR phase detector.
Compared to Fig. 4-4, however, it is somewhat simpler in implementation, and has
77
M1M4 M7Vbias
M2 Vout
VoutM5 M8
Vctrl
M3
M6
Figure 4-10: Ring oscillator schematic
a smaller transistor count. It also has less delay from the clock inputs to the phase
detector outputs, which is important because the phase detector time constant helps
set the PLL feedback poles.
The core (M 1 - M6 ) is an NMOS-loaded arbiter which acts as a nonlinear phase
detector. For no input phase difference, the output is balanced. As the phase differ-
ence increases from zero, one output will be asserted for the full duration of an input
pulse, while the other output will be asserted for only the remainder of the input pulse
duration after the first input pulse ends, which is equal to the input phase difference.
Thus the detector has very high gain near zero phase error that drops off to zero as
the input phase difference approaches the input pulse width (Fig. 4-12).
The pulse generators P and P 2 enable this arbiter to give frequency error feedback.
If one input is at a higher frequency than the other, its output will be asserted for
more input pulses than the other. Because the width of the pulses is independent
of input frequency, the average output voltage corresponds to frequency. Unlike a
typical phase-frequency detector, however, the strength of the error signal falls to
zero as frequency difference goes to 0, so there can be no modelock problems, yet
large signal frequency- (and hence, phase-) locking is enhanced. Fig. 4-13 shows
the large signal correction and small signal behavior of the entire array of PLLs as
79
M1M
Y1
I8M2 M5
............. MM
P1
M4
Y2
Ii
P2
Figure 4-11: Phase detector
the already internally-locked array approaches and locks to the reference clock. The
detector fits in 3Opum x 30pm.
4.2.3 Loop Filter
This loop filter, Fig. 4-14, is conceptually identical to the previous loop filter, Fig. 4-
8, though for biasing reasons, the wide bandwidth amplifier now has p-inputs and a
current mirror, and the high gain amplifier loads are cascoded.
M, - M5 make up amplifier A1 , while M9 - M17 make up A2 . The differential
output currents from the phase detectors at the edges of each tile are summed at
nodes In+ and In-, and drive both amplifiers. A1 is a single stage differential pair
so it has relatively low gain but a bandwidth limited by gm/Cgs. A2 has a high gain
cascoded stage driving a common source PFET M17. M1 6 is a large gate capacitor
which serves to set the dominant pole of M2 such that the PLL network is stable. M15
is biased at very low current to boost gain and enable a low time constant (as low
80
-. . - -.. ...... ..... ..
-. -.. ....... .... ........ ...
-. ........- ...- ... -. ... -. ....- ...
....... -.. ..... -.. ....-. .
-. ...-. ..-. .- ...- ..- ... -. . -. . -.. . -
OU
40
30CL
20(a0~3 10
0
U -10
-200
-30
-40
-50-0.
06
55
Small Signal Regime
05-
LargeSignal
4 Regime
04
35 - Referenceclock
0. 1 1 2 2
Figur
0.5 1 1.5 2 2.5 3 3.5Simulation time (microseconds)
e 4-13: Locking behavior of the PLL array
81
2 -0.1 0 0.1Time difference (nanoseconds)
Figure 4-12: Simulated phase transfer curve
1.
1.0
8 1.4)(A0
S.o00
0E5)
1.0
1.
0.2
M1 pbias M6
M2 M3
M7
In-
In+
M9 M10
M16M1l M12
AM10
M13 M14 M17
ML2
6 Out
M4 M5 M8 nbias M15
Figure 4-14: Loop filter schematic
as 12kHz) with a 15pm x 15pam gate capacitor. The simple design and feed-forward
compensation allow the loop filter to fit in only 15pm x 45pum. Each clock node,
consisting of an oscillator and a loop filter, takes just 45pum x 45pum.
82
Chapter 5
On-Chip Measurement of Clock
Performance
While increasing resources are devoted to implementing low skew and low jitter clocks
in modern microprocessors, there are few ways to measure jitter. Skew can be mea-
sured by such off-chip methods as e-beam [66] and photonic emission [67, 68], but
because both average thousands of edges, neither method is suitable for resolving
cycle-by-cycle clock jitter. A method to measure clock jitter was developed in this
thesis. A proof-of-concept test chip showed that excellent measurement performance
is possible, and this chapter describes the theory and results from that chip.
5.1 Introduction and Motivation
On-chip measurement necessarily
requires tricks. Acceptable clockAID
skew is generally around 10% of a 2
clock cycle and a microprocessor
clock period is typically 8-12 gate Figure 5-1: Time to voltage converter operationdelays. Hence, the measurement
necessarily requires timing resolution smaller than a single gate delay. Time-to-voltage
converters work by integrating a current onto a capacitor, as in Fig. 5-1 [69, 70, 71].
83
Delay Tune
CLK IDLL
E PD
Phase Interpolator
I
SiglnR[iJ Out [i]
Figure 5-2: Phase vernier
The capacitor starts with 0 voltage; at the beginning of the interval to be measured,
switch S1 closes, and the capacitor charges for the duration of the interval. Then S,
opens, the voltage is amplified, converted to a digital value and output, and then S2
closes to reset the capacitor. Such converters may have high dynamic range but do
not have enough resolution for clock jitter measurement, essentially because the time
of interest is comparable to the time it takes to open and close switch S 1.
Another approach is to sample the signal of interest into registers which are clocked
by closely-spaced sampling phases, as shown in Fig. 5-2. The interpolator takes in
several uniformly-placed phases and generates a larger number of phases with closer
spacing. The newly generated phases are used to clock a string of registers, marked
R[i] in the figure. The timing of a transition on SigIn can be deduced to within
the spacing of the sampling phases. Effectively, the registers compare the transition
instant of the input signal Sigln to a set of fixed times, just as a flash analog-to-digital
converter (ADC) compares an input voltage to a set of voltage thresholds. Because
of the similarity, it is useful to think of this architecture as a flash time-to-digital
84
I I
I I I I I I
converter, or TDC. Because the comparison thresholds are clock phases, this will be
called a sampling phase time-to-digital converter, or SPTDC. Either a delay-locked
loop with phase interpolation (as shown) or an array oscillator can be used to generate
sampling phases with time differences smaller than a single gate delay [72, 73, 74, 75].
However, mismatches between the oscillators in the array or delays in a DLL can be
significant, giving as much as a gate delay offset before calibration [72].
The approach presented here is also a flash TDC, but rather than creating the
time vernier by generating closely-spaced clocks, the vernier arises from input-referred
offset on the samplers. Hence, the proposed converter will be called a sampling offset
time-to-digital converter, or SOTDC. The advantage is that instead of needing to
generate precise clocks, it is necessary only to create some sampling elements and
measure their relative positions. As will be demonstrated, measurement can be much
more precise than any calibration is likely to be. The SOTDC was developed to
measure jitter between clock domains, but it works to measure the timing of any
signal relative to a reference.
5.2 Time-to-Digital Converter Fundamentals
Calibration and operation of the SOTDC depends critically on the operation of the
sampling elements. (In Fig. 5-2, the sampling elements were registers, but they were
acting as arbiters.) An arbiter is a circuit that determines which of two inputs arrived
first. Because only the time difference between rising edges of the two inputs affects
the output, it is conventional to think of the arbiter as having a single input, where
that input is a time interval t between two incoming edges, as shown in Fig. 5-3(a).
Given enough time, the output of an arbiter settles to either a logic '1' or '0', indicating
whether the first or second input arrived first. Unfortunately, device mismatch gives
arbiters an effective time offset, t,,. Also, because of thermal noise, the output, y,
is not deterministic. y(t) = 1 if and only if t > t0, + t,, where t, is white noise with
standard deviation - [76, 77]. Therefore, the probability that the output y is a '1' is
85
1
0.8
21 y
Xt
0
time
(a) Arbiter input defini-tion
- 0.6
a'0.4
0.2 F
O'-2 -1 0 tos
t/O-1 2
(b) Probability that arbiter output is a 1
Figure 5-3: Arbiter definitions
) In2
Inl0 D D
tos tos
A A
tos
thermometer decode logic
Figure 5-4: TDC structure. "D" marks delay elements, and "A" the arbiters.
given by the Gaussian cumulative density function
P(y= 1) = 1+ erf ( -tos (5.1)
which is plotted in Fig. 5-3(b). The strong sensitivity of y to t near t = t0 s makes the
arbiter useful for precise time measurement.
Fig. 5-4 shows the simplified theory of operation of a flash TDC (cf. a flash ADC).
In any flash converter, the input is compared to a set of thresholds; call the thresholds
x. In a TDC, x is the set of offset times to which the input time t is compared. In
86
............. ..............
a SPTDC, each threshold xi is composed of a vernier delay D and an arbiter offset
t0,. Variation of t, is significant- the standard deviation of t0 s, at, is about 18ps in
0.35pm CMOS. Fig. 5-5(a) shows a plot of ideal x for an 8-level converter; Fig. 5-5(b)
shows the actual positions of the x with normally distributed t,,. Because the a-t
is large, errors in the x are significant. However, the random spread of t,, suggests
another approach to generating the x: eliminate the vernier delay entirely, and let
xi = t,2 . Fig. 5-5(c) shows typical x for such a converter,
5.3 SOTDC Yield
The random placement of xi in an SOTDC means that measurement precision varies
from chip to chip. Finding a formula for the expected yield given a desired precision
over a fixed range is surprisingly difficult. The problem is quite amenable to Monte
Carlo simulation, however. A simulated plot of expected yield vs. precision is shown
in Fig. 5-6.
5.4 Calibration of a SOTDC
Of course, a vernier-less, or sampling offset TDC is useless if it cannot be calibrated:
the outputs of the arbiters give information about the input signal in terms of the xi;
if the xi are unknown, the arbiter outputs are useless. Fortunately, it is possible to
find x empirically.
A TDC could be calibrated directly by connecting two signals with precisely-
known t and measuring resulting outputs for t over the range of interest. Fitting the
probabilities of an output '1' vs. t for each arbiter via Eq. 5.1 gives the effective x.
Unfortunately, input jitter adds linearly to the apparent measurement noise in this
case. In cases where it is impossible or inconvenient to input known signals, it is also
possible to calibrate a flash TDC indirectly with uncorrelated signals.
For uniformly distributed t, the probability that t is measured between two sam-
pling thresholds, P(xi+tn > t > xj+ts) A Pij(01), is proportional to xi-xj Aij for
87
U')7C3
0
00.a
x
0D
60-
40-
20-
0
-20
-40-
-602 4 6 8
(a) Ideal, xi oc i
0 2 4 6 8
(b) xi oc i + t,,, 18ps std. dev.
2 4 6 8
(c) xi = t,, 18ps std. dev.
Figure 5-5: x(i) vs. i
88
40
20
0
(i2
00-(D~U,0
x
-a-20
-4C 7
40-
30-
20-
10
0
c0
0.
a,
-10C
3 4precision (ps)
Figure 5-6: Expected yield of anstandard deviation.
SOTDC, for a fixed precision over a range of one
a single event, as long as the difference is much larger than sampling noise, Aj > t,.
For example, if the two input signals are constant-frequency square waves, measure-
ments with bit i low and bit j high will occur with a frequency of Aijfif 2 where fi
and f2 are the frequencies of the two input signals. While x can be fully deduced
from such measurements, the resolution is poor for Aj e t,,.
A second indirect calibration method resolves small Aij in terms of o-. When Aij
is comparable to t, there will sometimes be a "bubble" in the output codeword;
that is, it will appear that xj + t, > t > xi + t, even though xi > x3 . The ratio
r = Pi(10)/Pij(01) should depend only on 6 = Ai\j/-, and in fact, it does.
Consider two arbiters with ti = x, + ti and t2 = X2 + tn2.
instantaneous switching thresholds of the arbiters, so
P(y1 = 1) = P(t > ti)
P(y2 = 0) = P(t < t2 )
P(y1 = 1 ,y2 = 0) A P12 (10) = P(ti < t < t2)
P12 (10) = P(ti < t2) - P(ti < t < t2 I t1 < t2)
t1 and t2 are the
(5.2)
(5.3)
(5.4)
(5.5)
89
1
0.8
0.6V
0.4
0.2
-2 5 6'
Let x =t2- t1 . Then x is Gaussian with mean x 2 - x 1 = At and standard deviation
2u. For uniformly distributed t, P(ti < t < t 2 ti < t 2 ) Oc t 2 - t1 . Substituting into
Eq. 5.5,
P12 (10) Oc x - P(X > 0) (5.6)
Oc x e 4a 2 dx (5.7)
Oc je (4a2)+ At1 + erf (5.8)VIT2 ( 2or
By symmetry, P12(01)1 ,t= P 12 (10)1,,-,. Defining 6 = and erfcx(x) = ex 2 2 f: et 2 dt
gives
) P 12 (10) 1+ VF -erfcx(-6)r (6) = =_ (5.9)P 12 (01) 1 - F6 -erfcx(6)
In this way an array of arbiters can be calibrated to much higher precision than their
manufacturing tolerances without the use of precise input clocks.
Thus, by measuring r and inverting Eq. 5.9, one can find relative spacings of x
in terms of a. Combined with either of the previous two methods calibrations, this
measurement thus gives a and precise measurements of x. Note that both indirect
methods are completely insensitive to input jitter.
5.5 Circuit and Results
The SOTDC circuit consists of a set of nominally identical arbiters and output cir-
cuitry to transfer the bits off-chip. The implemented symmetric CMOS arbiter is
shown in Fig. 5-7. The outputs are precharged when Inl and In2 are low (for clock
systems where jitter is meaningful, there will be substantial overlap between the low
phases of the inputs). The first edge that arrives pulls down the corresponding out-
put, and the positive feedback guarantees that eventually a valid logic value can be
latched from the output. For the test chip, 64 such arbiters were connected in parallel
90
M1 M4
Y1 Y2
M2 M5
Inl MM6 In2
Figure 5-7: Symmetric CMOS arbiter
to two test inputs, and their outputs individually recorded.
Fig. 5-8 shows x for one test chip measured directly. As expected, process vari-
ations distribute the x over a range of approximately 50 picoseconds. A plot of x
calculated by numerically inverting Eq. 5.9 for measured data vs. x measured directly
is shown in Fig. 5-9. The fit is perfect to within the tolerances of the measurement
equipment; clearly, calibration by random signals is viable. Best fit - is 0.35 picosec-
onds, which corresponds to an arbiter aperture of ~ lps, consistent with a previously
reported simulated value of 10ps in a 3pm CMOS process. Nonuniform spacing of the
arbiter thresholds limits resolution of this TDC to 2ps over the range [-15ps,15ps].
The goal of this part of the thesis was to measure jitter in the 16 oscillator chip
described in Chapter 4. A set of arbiters was connected between the clocks of neigh-
boring tiles, and a 128-word DRAM recorded arbiter results. Unfortunately, the
DRAM timing was marginal on that test chip, so direct measurements were unavail-
able.
91
70
60
50
40
30
20
101-
-40 -20 0 20threshold x(i), picoseconds
40
Figure 5-8: Measured xi, with expected curve for 18ps standard deviation of t,,.
20
o6
0
.3LU
C)
CO)
10 1
0
-10
-20'-40 -20
)
0 20 40directly-measured x(i)
Figure 5-9: Measured xi vs. xi derived via Eq. 5.9, for a- = 0.3 5ps
92
00
0
I
Chapter 6
Conclusions
6.1 Summary and Contributions
A great deal of work has been done previously on clocks in integrated circuits. As the
ratio of clock period to wire-delay across a chip decreases, more and more attention
is being devoted to clocking. An attempt was made in this thesis to look forward, to
predict the clocks necessary in the near future to continue the trend of faster devices
and faster clocks.
One contribution of this thesis has been the analysis of clock networks in terms
of performance given parameter variations and noise. Although much of the focus
has been on the contrast between different clock networks, the conclusion is that
the different architectures do not replace but rather complement each other. Over
a single tile where signal propagation delay is small compared to the clock period
and all points must be synchronized, tree distribution is effective. For relatively long
distances on a chip, clock regeneration becomes useful to filter out high frequency
noise on the distribution wire. A multiple-oscillator peer network also avoids the
problem of having different paths to nearest neighbors that plagues trees. Gridded
distribution, or more generally shorting together spatially separated buffers greatly
reduces skew and jitter between tiles as long as the initial offsets are small.
Another contribution is the analysis and implementation of a clock network that
uses distributed generation. Theory about mode-locking was extended to account for
95
non-orthogonal networks. Inter-oscillator coupling was treated in the context of a
single multivariable system which exposes all possible interactions. The phase detec-
tor and oscillator were modified from standard versions to satisfy the requirements
needed for a distributed clock. Although the details will likely be changed (short-
ing together the tiles and finding another way to measure phase differences between
clocks is an obvious improvement) the main strength of this architecture is that the
clock traverses the same path, peer-to-peer, as does the data. Because the clock can
be measured and corrected over multiple cycles, however, it appears that clock skew
can always be corrected to a fraction of the uncertainty in data delay. In other words,
it should always be possible to distribute a clock using the same technology as is used
for long-distance interconnect.
Verification of clock design will likely become more important as a way to con-
firm predictions about clock performance. The proposed and tested sampling offset
time to digital converter appears to be well-suited to this task, with resolution of a
small fraction of a single gate delay. Because of its extreme hardware simplicity and
generality, the SOTDC may find its way onto many chips as a simple debugging tool.
6.2 Future Work
This thesis was dominated by analysis and implementation of the distributed clock
network, and of how that network compares with conventional clock networks. This
leaves a two-fold opening for future work: more accurate testing and comparison to
conventional clock networks, and the development clock architectures that are as yet
impractical.
6.2.1 Testing and measurement
The focus of the design and testing of the multiple-oscillator array was on initial
locking and stability. Testing received substantially less attention. Another version of
that chip with a more robust DRAM (so that precise timing data could be obtained),
and controllable, on-chip noise generators (i.e., large transistors between power and
96
ground) would help calibrate the noise models.
On a similar topic, distributed PLLs make low-speed functional testing difficult.
For distributed clock generation to move to production, stability of the network at
low-speeds should be addressed. It's trivial to add a controllable divider for each node
oscillator; however, the extra delay will certainly make the network unstable unless
other changes are made.
6.2.2 Unconventional Clocks
Grids and clock trees have found widespread use in industry already. A number of
other clocking strategies have been proposed that may either find use in niche appli-
cations, or perhaps someday take over as the dominant clock method if technology
evolves to makes them more attractive.
Salphasic
Salphasic clocking is conceptually related to equipotential clocking. If the wires are
lossless but the transmission line delay is causing clock skew, it is possible to set up
standing waves in the clock network. Because these standing waves are perfectly syn-
chronous with the signal at the driver, a clock can be distributed over long distances
with no skew. Of course, this depends on having lossless transmission lines for clock
distribution; this constraint can be approximated closely in systems on the scale of
several meters with clocks in the tens of megahertz [36]. On chip, however, resistance
in the wires has made salphasic clocking untenable.
Resonant Clocks
Resonant clocking is a similar approach, intended for a different purpose. A standing
wave is set up in a transmission line with a period equal to the desired period of a
clock. With care, a transmission line can be tuned to resonate a fundamental and
several odd harmonics in phase, despite the capacitive load and small resistive losses
in the wire so that a true square wave appears at the load [37]. A resonant clock in
97
a low-loss transmission line dissipates a fraction of the CV 2f power that traditional
clock networks do. The technique is relatively new, and has not been proven to be
practical at high speeds.
Optical Clocking
Because the propagation speed of optical signals is easily controlled, optical clocks
have been suggested as a way to equalize path delay and thus minimize clock skew [38,
39]. Optical signals, transmitted either in a tree, as in the first citation, or in free space
as in the second, also have the advantage that they do not interfere with each other,
and are immune to electrical or magnetic coupling. Unfortunately, the conversion
from optical signals to electrical is a significant stumbling block. Detectors for optical
signals are not silicon, and hence require a substantial fabrication process change.
Second, the conversion is often relatively slow and error prone because the detected
currents are small. No optical clock has been demonstrated for VLSI, although optical
clocks may become practical in the future.
98
Bibliography
[1] Neil H. E. Weste and Kamran Eshraghian. Principles of CMOS VLSI design.
Addison Wesley, 2 edition, 1990.
[2] Daniel W. Bailey and Bradley J. Benschneider. Clocking design and analysis for
a 600 MHz Alpha microprocessor. Journal of Solid State Circuits, 33(11):1627-
1633, November 1998.
[3] Stephen H. Unger and Chung-Jen Tan. Clocking schemes for high-speed digital
systems. IEEE Transactions on Computers, C-35(10):880-895, October 1986.
[4] Arthur F. Champernowne et al. Latch-to-latch timing rules. IEEE Transactions
on Computers, 39(6):798-808, June 1990.
[5] E. G. Friedman. The applications of localized clock distribution design to im-
proving the performance of retimed sequential circuits. In Proceedings of the
IEEE Asia-Pacific Conference on Circuits and Systems, pages 12-17, December
1992.
[6] Karem A. Sakalh et al. Synchronization of pipelines. IEEE Transactions on
Computer-Aided Design, 12(8):1132-1146, August 1993.
[7] Jose Luis Neves and Eby G. Friedman. Topological design of clock distribution
networks based on non-zero clock skew specifications. In Proceedings of the 36th
Midwest Symposium on Circuits and Systems, pages 468-471, August 1993.
99
[8] Narendra V. Shenoy, Robert K. Brayton, and Alberto L. Sangiovanni-Vincentelli.
Resynthesis of multi-phase pipelines. In Proceedings of the ACM/IEEE Design
Automation Conference, pages 490-496, June 1993.
[9] C. Thomas Gray et al. Timing constraints for wave-pipelined systems. IEEE
Transactions on Computer-Aided Design, 13(8):987-1004, August 1994.
[10] Michel R. Dagenais and Nicholas C. Rumin. On the calculation of optimal
clocking parameters in synchronous circuits with level sensitive latches. IEEE
Transactions on Computer-Aided Design, 8(3):268-278, March 1989.
[11] Karem A. Sakallah, Trevor N. Mudge, and Oyekunle A. Olukotun. Analysis and
design of latch-controlled synchronous digital circuits. IEEE Transactions on
Computer-Aided Design, 11(3):322-333, March 1992.
[12] Tolga Soyata and Eby G. Friedman. Retiming with non-zero clock skew, vari-
able register, and interconnect delay. In Proceedings of the IEEE International
Conference on Computer-Aided Design, pages 234-241, November 1994.
[13] Francois Angeau. A synchronous approach for clocking VLSI systems. Journal
of Solid State Circuits, SC-17(1):51-56, February 1982.
[14] H. B. Bakoglu, J. T. Walker, and J. D. Meindl. A symmetric clock-distribution
tree and optimized high-speed interconnections for reduced clock skew in ULSI
and WSI circuits. In VLSI in Computers and Processors, pages 118-122, Rye
Brook, NY, October 1986. IEEE International Conference on Computer Design.
[15] Allan L. Fisher and H. T. Kung. Synchronizing large VLSI processor arrays.
IEEE Transactions on Computers, C-34(8):734-740, August 1985.
[16] Ahmed El-Amawy. Clocking arbitrarily large computing structures under con-
stant skew bound. IEEE Transactions on Parallel and Distributed Systems,
4(3):241-255, 1993.
100
[17] Daniel W. Dobberpuhl et al. A 200-MHz 64-b dual-issue CMOS microprocessor.
Journal of Solid State Circuits, 27(11):1555-1567, November 1992.
[18] Bradley J. Benschneider et al. A 300-MHz 64-b quad-issue CMOS RISC micro-
processor. Journal of Solid State Circuits, 30(11):1203-1214, November 1992.
[19] Paul E. Gronowski et al. A 433-MHz 64-b quad-issue RISC microprocessor.
Journal of Solid State Circuits, 31(11):1687-1696, November 1996.
[20] Donald F. Wann and Mark A. Franklin. Asynchronous and clocked control struc-
tures for VLSI based interconnection networks. IEEE Transactions on Comput-
ers, C-32(3):284-293, March 1983.
[21] S. Y. Kung and R. J. Gal-Ezer. Synchronous versus asynchronous computation
in very large scale integrated (VLSI) array processors. Proceedings of SPIE,
341:53-65, May 1982.
[22] Sanjay Dhar, Mark A. Franklin, and Donald F. Wann. Reduction of clock delays
in VLSI structures. In IEEE International Conference on Computer Design,
pages 778-783, October 1984.
[23] Mehdi Hatamian and Glenn L. Cash. Parallel bit-level pipelined VLSI designs for
high-speed signal processing. Proceedings of the IEEE, 75(9):1192-1202, Septem-
ber 1987.
[24] Eby G. Friedman and Scott Powell. Design and analysis of hierarchical clock
distribution system for synchronous standard cell/macrocell VLSI. Journal of
Solid State Circuits, SC-21(2):240-246, April 1986.
[25] Michael A. B. Jackson, Arvind Srinivasan, and E. S. Kuh. Clock routing for high-
performance ICs. In 27th Proceedings of the ACM/IEEE Design Automation
Conference, pages 573-579, June 1990.
[26] Fumihiro Minami and Midori Takano. Clock tree synthesis based on RC delay
balancing. In Proceedings of the IEEE Custom Integrated Circuits Conference,
pages 28.3.1-28.3.4, May 1992.
101
[27] Ting-Hai Chao, Yu-Chin Hsu, Jan-Ming Ho, Kenneth D. Boese, and Andrew B.
Kahng. Zero skew clock routing with minimum wirelength. IEEE Transactions
on Circuits and Systems-Il: Analog and Digital Signal Processing, 39(11):799-
814, November 1992.
[28] Jason Cong, Andrew B. Kahng, and Gabriel Robins. Matching-based methods for
high-performance clock routing. IEEE Transactions on Computer-Aided Design,
12(8):1157-1169, August 1993.
[29] Ren-Song Tsay. An exact zero-skew clock routing algorithm. IEEE Transactions
on Computer-Aided Design, 12(2):242-249, February 1993.
[30] Andrew B. Kahng and C.-W. Albert Tsao. Practical bounded-skew clock routing.
Journal of VLSI Signal Processing, 16(2/3):87-103, June/July 1997.
[31] Shantanu Ganguly, Daksh Lehther, and Satyamurthy Pullela. Clock distribu-
tion methodology for the PowerPC microprocessors. Journal of VLSI Signal
Processing, 16(2/3):181-189, June/July 1997.
[32] Earl T. Cohen et al. A 533MHz BiCMOS superscalar microprocessor. In ISSCC
Digest of Technical Papers, pages 164-165, February 1997.
[33] Charles F. Webb et al. A 400MHz S/390 microprocessor. In ISSCC Digest of
Technical Papers, pages 168-169, February 1997.
[34] Toyohiko Yoshida et al. A 2V 250MHz multimedia processor. In ISSCC Digest
of Technical Papers, pages 266-267, February 1997.
[35] G. Geannopoulos and X. Dai. An adaptive digital deskewing circuit for clock
distribution networks. In ISSCC Digest of Technical Papers, pages 400-401,
February 1998.
[36] Vernon L. Chi. Salphasic distribution of clock signals for synchronous systems.
IEEE Transactions on Computers, 43(5):597-602, May 1994.
102
[37] M. E. Becker and T. F. Knight, Jr. Transmission line clock driver. In IEEE
International Conference on Computer Design, pages 489-490, October 1999.
[38] C.-S. Li, F. Tong, K. Liu, and D. G. Messerschmitt. Fanout analysis of multi-
stage optical clock distribution using optical amplifiers. In Globecom, pages
434-438, 1991.
[39] Helmut Zarschizky, Christian Gerndt, Martin Honsberg, and Ekkehard Klement.
Optical clock distribution with a compact free space interconnect system. In
IEEE Lasers and Electro-Optics Society Annual Meeting, pages 590-591, 1992.
[40] Gill A. Pratt and John Nguyen. Distributed synchronous clocking. IEEE Trans-
actions on Parallel and Distributed Systems, February 1995.
[41] David G. Messerschmidt. Synchronization in digital system design. IEEE Journal
Selected Areas in Communications, 8(8):1404-1419, October 1990.
[42] Morteza Afghahi and Christer Svensson. Performance of synchronous and
asynchronous schemes for VLSI systems. IEEE Transactions on Computers,
41(7):858-872, July 1992.
[43] D. Boning and S. Nassif. Models of Process Variations in Device and Intercon-
nect, chapter 6. IEEE Press, 2000.
[44] Brian E. Stine et al. Simulating the impact of poly-CD wafer-level and die-level
variation on circuit performance. In Second International Workshop on Statistical
Metrology, June 1997.
[45] M. Eisele, J. Berthold, R. Thewes, E. Wohlrab, D. Schmitt-Landsiedel, and
W. Weber. Intra-die device parameter variations and their impact on digital
CMOS gates at low supply voltages. In Technical Digest of IEDM, pages 67-70,
1995.
[46] Duane S. Boning and James E. Chung. Statistical metrology - measurement
and modelling of variation for advanced process development and design rule
103
generation. In Proceedings of the International Conference on Characterization
and Metrology for ULSI Technology, March 1998.
[47] Tomohisa Mizuno, Jun-ichi Okamura, and Akira Toriumi. Experimental study
of threshold voltage fluctuation due to statistical variation of channel dopant
number in MOSFET's. IEEE Transactions on Electron Devices, 41(11):2216-
2221, November 1994.
[48] Martin Eisele, J6rg Berthold, Doris Schmitt-Landsiedel, and Reinhard
Mahnkopf. The impact of intra-dive device parameter variations on path delays
and on the design for yield of low voltage digital circuits. IEEE Transactions on
VLSI, 5(4):360-368, December 1997.
[49] Xinghai Tang, Vivek K. De, and James D. Meindl. Intrinsic MOSFET parameter
fluctuations due to random dopant placement. IEEE Transactions on VLSI,
5(4):369-376, December 1997.
[50 D. C. Keezer and V. K. Jain. Design and evaluation of wafer scale clock dis-
tribution. In Proceedings of the IEEE International Conference on Wafer Scale
Integration, pages 168-175, January 1992.
[51] Jos6 Luis Neves and Eby G. Friedman. Circuit synthesis of clock distribution
networks based on non-zero clock skew. In Proceedings of the IEEE International
Symposium on Circuits and Systems, pages 4.175-4.178, June 1994.
[52] Mohamed Nekili, Guy Bois, and Yvon Savaria. Pipelined H-trees for high-speed
clocking of large integrated systems in the presence of process variations. IEEE
Transactions on VLSI, 5(2):161-174, June 1997.
[53] Masakazu Shoji. Elimination of process-dependent clock skew in CMOS VLSI.
Journal of Solid State Circuits, SC-21(5):875-880, October 1986.
[54] Satyamurthy Pullela, Noel Menezes, and Lawrence T. Pillage. Reliable non-
zero skew clock trees using wire width optimization. In 30th Proceedings of the
ACM/IEEE Design Automation Conference, pages 165-170, June 1993.
104
[55] Masato Edahiro. Delay minimization for zero-skew routing. In Proceedings of
the IEEE International Conference on Computer-Aided Design, pages 563-566,
November 1993.
[56] Steven D. Kugelmass and Kennet Steiglitz. An upper bound of expected clock
skew in synchronous systems. IEEE Transactions on Computers, 39(12):1475-
1477, December 1990.
[57] Marios D. Dikaiakos and Kenneth Steiglitz. Comparison of tree and straight-
line clocking in long systolic arrays. Journal of VLSI Signal Processing, pages
1177-1180, 1991.
[58] Keith A. Bowman, Xinghai Tang, John C. Eble, and James D. Meindl. Imapact
of extrinsic and intrinsic parameter variations on CMOS system on a chip perfor-
mance. In Proceedings of the ASIC/SOC Conference, pages 267-271, September
1999.
[59] Marcel J. M. Pelgrom, AAD C. J. Duinmaijer, and Anton P. G. Welbers. Match-
ing properties of MOS transistors. Journal of Solid State Circuits, 24(5):1433-
1440, October 1989.
[60] Shy-Chyi Wong, Kuo-Hua Pan, Dye-Jyun Ma, M. S. Liang, and P. N. Tseng. On
matching properties and process factors for submicrometer CMOS. In Proceed-
ings of the 1996 IEEE International Conference on Microelectronic Test Struc-
tures, volume 9, pages 43-47, March 1996.
[61] Shih-Wei Sun and Paul G. Y. Tsui. Limitation of CMOS supply-voltage scal-
ing by MOSFET threshold-voltage variation. Journal of Solid State Circuits,
30(8):947-949, August 1995.
[62] M. Nekili, Y. Savaria, and G. Bois. Spatial characterization of process variations
via MOS transistor time constants in VLSI and WSI. Journal of Solid State
Circuits, 34(1):80-84, January 1999.
105
[63] Payman Zarkesh-Ha, Tony Mule, and James D. Meindl. Characterization and
modeling of clock skew with process variations. In Proceedings of the IEEE 1999
Custom Integrated Circuits Conference, pages 441-444, 1999.
[64] Ian A. Young, Monte F. Mar, and Bharat Bhushan. A 0.35pm CMOS 3-880MHz
PLL N/2 clock multiplier and distribution network with low jitter for micropro-
cessors. In ISSCC Digest of Technical Papers, pages 330-331, February 1997.
[65] Raghunand Bhagwan and Alan Rogers. A 1GHz dual-loop microprocessor PLL
with instant frequency shifting. In ISSCC Digest of Technical Papers, pages
336-337, February 1997.
[66] P. J. Restle, K. A. Jenkins, A. Deutsch, and P. W. Cook. Measurement and mod-
eling of on-chip transmission line effects in a 400 MHz microprocessor. Journal
of Solid State Circuits, 33(4):662-665, April 1998.
[67] Y. Uraoka, T. Maeda, I. Miyanaga, and K. Tsuji. New failure analysis technique
of ULSIs using photon emission method. In Proceedings of the International
Conference on Microelectronic Test Structures, volume 5, pages 100-105, March
1992.
[68] Yukiharu Uraoka, Isao, Miyanaga, Kazuhiko Tsuji, and Shigenobu Akiyama.
Failure analysis of ULSI circuits using photon emission. IEEE Transactions on
Semiconductor Manufacturing, 6(4):324-331, November 1993.
[69] Andrew E. Stevens, Richard P. Van Berg, Jan Van Der Spiegel, and Hugh H.
Williams. A time-to-voltage converter and analog memory for colliding beam
detectors. Journal of Solid State Circuits, 24(6):1748-1752, December 1989.
[70] C. Konstadakellis, S. Siskos, and Th. Laopoulos. A fast, versatile, CMOS time-
to-voltage converter. In Proceedings of the 6th Mediterranean Electrotechnical
Conference, pages 282-285, 1991.
[71] Elvi Rdissinen-Routsalainen, Timo Rahkonen, and Juha Kostamovaara. A time
digitizer with interpolation based on time-to-voltage conversion. In Proceedings
106
of the 40th Midwest Symposium on Circuits and Systems, pages 197-200, August
1997.
[72] Dan Weinlader, Ron Ho, Chih-Kong Ken Yang, and Mark Horowitz. An eight
channel 36Gsample/s CMOS timing analyzer. In ISSCC Digest of Technical
Papers, pages 170-171, 2000.
[73] Thomas A. Knotts, David Chu, and Jeremy Sommer. A 500MHz time digitizer
IC with 15.625ps resolution. In ISSCC Digest of Technical Papers, pages 58-59,
1994.
[74] Yasuo Arai and Masahiro Ikeno. A time digitizer CMOS gate-array with a 250 ps
time resolution. Journal of Solid State Circuits, 31(2):212-219, February 1996.
[75] J. G. Maneatis and M. A. Horowitz. Precise delay generation using coupled
oscillators. Journal of Solid State Circuits, 28(12):1273-1282, December 1993.
[76] Linsay Kleeman. The jitter model for metastability and its application to re-
dudnant synchronizers. IEEE Transactions on Computers, 39(7):930-942, July
1990.
[77] W. A. M. Van Noije, W. T. Liu, and S. J. Navarro, Jr. Precise final state
determination in mismatched CMOS latches. Journal of Solid State Circuits,
30(5):607-611, May 1995.
107
A
VU-
p4hi1late phil philearly
Ef sam pled phase-com pphi2late phi2 phi2erly
I-I
m5
V
aCV
SiC0Si
SiAPhillate phil philearlyV sampled-phase-comp 141
Si L~j phi2late phi2 phi2early
AilAl 1
E o_0- 0
0
a- Q aC_
E0
o! a)
m im
03
9
phi<0>-U-.
~E0
L4
a_ 0 -
E0
a a0 WC(q
_E
U.
Em
IREF foster clock slowerskewfaster skewslower -
onode
1340
foster clock slower
v hillate phil philearlyL sopledphose-comp 144
ophi2te phi2 phi2eorlyAk--U
-U
(0
Figure A1.1: Top-level (chip core)
110
IREF foster clock slowerskewfaster skewslower
node
1470 0
faster clock slower
Ao
AV
A
W-
m_
I-
IREF foster clock slowerskewfaster skewslower
node
145 -
o 0
-0 -0
oe 0
foster clock slower
[REF foster clock slowerskewfaster skewslower
a rjnode
135 -0 0o 0
0 -0
V V)o 0
faster clock slower
-1 -
'"st. f.*lr locd1 b-,, , ~ 125 124-o.. -*'-p 4 'oad out __>c
slower nolood2 b' 3 Ifr
Figure A1.2: Node
T10/1.2 24/1 2 24/1.2
aa
12/0.6 12/0.6
24/0.6 24/0.6
gnd! gnd!
COpi cop2
nbias12/1.2
Figure A1.3: Relaxation oscillator
III
T
F/rA.4 meat. 1.- an m e
//1.8 6/1.8.
8/1.8 6~~~/1-8 .1. 18/. 1
15/1. 1512/ ./1.2
Figure A1.4: Compensation amplifier and summer
6/0.6 6/0.6
out
in 6/0.6 6/0.6 in
6/0.6
Figure A1.5: Differential to single-ended amplifier
112
1.2/3
2/.6 6phi2early
Outl Ophi lote
122/2
phil D phrl 1 all12
p~p~us-ephrlealy
Fiure A Sml paecmr
2/5 pp. /
ph 4 13
Fig11e A16Tapeehaecmaao
113
3/0.6 3/0.6
6/0,6 pi6/0.6
3/0 .6 3/0.6
gnd!
15phi2
116
Figure A1.7: Phase comparator core
114
phil -4 phi2
e
reirefcloc edrea% ,
dataswitv h
lut.ser~I< 12>UIotsoritt
:lockrefelc od-
datoswitc? t
-t.seil 13>srilotier Iq k
clock relrefelocV
reo I e
datoswit U-1136
C 0C c oC coc . .m rt ph 2 I 7 Ir ini 1121 phtI
slowom l phi,2lote lowerpher phaarly phi ear phillato slower phillate slowerfaster phillate faster pi2lt phile phileard
H ohl 1137 b0 0 1h2 rou c a a dr12 ma c
refelac r reteloc refefocZ rec* fca 11, reclo
Stch Ia wtch .5t data odatoswitch - datoswto * datoswltcb d-ta 5
S ut.seriol<21> "Ut.serfol<22> uk. erfol<2> luggerial424> 0out serial out.serial -- o tseria-l
R! CS 0 tla a a do i tdle kdck2oe
rout rI2UpthIl
slower poilar.y slower y Moto slower r phillate,ieryph Peah' ptieary
faster philiate phi2sar fster pilot foster ph' e p r
Shi31138 b 0d2 a i hi2 roua 0
re we r00 re -rarotore 0 .0 refoloc r
* ont r * cro wdat. a Ile tch 2 a - . data a QswItch a d Ed casWitch - a . r2 datoswitch -
eo uo at seril<F3l> * afserial432> ut~serol<33> £ aerfa 434>
ut -- eaa t.eria -- U .- a 1Uot.seralou o ut00 son a l
e h l c loe 0 clocktt Er a f tileaIlphlti~trout war ' 16 phi S w 119 phi l
slower phi2lote a: m a phillote slower pplear phillate
foster phillote phi2earl - a2lote phifear tster phIt2latehie I hi2 a r -a2 a
L4 re re
d 1
Iieil4> a10cl
MU C
data
ua erol 2>-111 'S T IhE0.
dot
uaFerio1443> aaE
0~a
0t ~
IF f ? rout phIi2 1- 1117 phi 1 1 1; '.lp*pl .try phi2 lte a 0 9 l pphil 2orL S h tlr phillate
. 11i p late phi2e . E - phi 2 at. philear - pki2eaty philearl - KphTi 014 t 00 k phl2lot k
f li 1t 14 phFS2 grou_ oropi ro c o
rTi
erial-44>- Z
L-i
lock
in.reok in.refelock
in.oTetI in ~dawF
in~dallosilitch in.dUoa-sw.TchR
I orwaOrdbackword
out.serial<11:13,212431:34.41:44>
p22late 116-y current-in n n
p p
Figure A2.1: Top-level (chip core)
115
rerefeloc 1111rea-
datow t
ut.se 1 1ucl ri
-Mot.i olI
clock
Wo
clock
clock
faster - faster
slower 0 -jslower
clock2refclock
readwrite
slower clock faster
clock -U--D clock
faster clock slower
1120|inclock- outclock
clock1114 1123
E-clock1 jmeasure At mux 1126 1127 128clock2 jAO Y out.serialrefclock Vn inv3x inv9xread out.somples --Iwrite
Figure A2.2: Individual tile
faster
comp amp- acobias -eE-iref phi ---- clockslowerringosc-2slower soe
Figure A2.3: Node
116
slower "
1129
clock
node-afaster "
a-
A./.3 / 0.7/0.35 0,7/0,35 4e- 13
0.7/0.35 0.7/0.35
slower 2.3/0.35 2.3/0.3 f stA vx1/1 8.4/0.35
A
8.4/0.35
slower
4203 1.8/0.6 1.0/0.6 faster4.2/0.35
4.2/0.35
1.5/0.35 1.5/0.35 .5'1.2 1.4/2.1
Figure A2.4: Compensation amplifier
Figure A2.5: Ring oscillator
117
VTloodbias W
out-
in+ 3.5/0.35
0.7/0.7
out+
3.5/0.35 in-
ibios 1/0.35
Figure A2.6: Differential inverter for the ring oscillator
17 115 n2d q - d q 117 nx119 118 n2inclodkq10---d-- d q 122
inc12k3 125
120 -dut-- dck
Figure A2.7: Clock divider
118
60
49
"(10 147
103.
[1 18 120 123e -1'ms*Pe -'"Tek
dd q ~
~~W.-1998
152< 1:0 15 1< 1:0> 1
Figure A2.8: Jitter measurement block
13 19y 194 195 -~gSi d qd q--
ck
Figure A2.9: Pulse generator
owtTok< utTokOut W wtTokOut
Outpu 4<ok :11>
c: utpul pkin tbu3 out.sarmples
9Whtkb ph2 ou
*6rite latch sdl-bitslice-E-writ. rw<cff:127>
DataClock lrnputTaken ww<O-:127:refcAock
ou 1ok9n outTokl k 19im 17<0:127>
r* wwrea d read read -
write write writedlrnmTokIn.dramTok 11 W>dl 0m1tckenoo drarmTok<1:127>.dromlakOut
drtomi tkkonshiftcik shiftclk dramTokOut
shiftclockb
r1110 11111 1112<0:3 1113<0:3 158 0:3>shiftclock
n~ y iv4
Figure A2.10: DRAM block
119
d q 11
Figure A2.12: DRAM bitslice
out2 h.0 . ./ out1
phH1 2.8/0.35 2.8/0.35 phi2
gnd!
Figure A2.13: Phase measurement arbiter
121