446 • 2007 IEEE International Solid-State Circuits Conference
ISSCC 2007 / SESSION 24 / MULTI-GB/s TRANSCEIVERS / 24.6
24.6 A 16Gb/s Source-Series Terminated Transmitter in 65nm CMOS SOI
Christian Menolfi, Thomas Toifl, Peter Buchmann, Marcel Kossel,Thomas Morf, Jonas Weiss, Martin Schmatz
IBM, Rueschlikon, Switzerland
The quest for high data rates at low power consumption and areahas renewed interest in the source-series terminated (SST) driv-er. While SST drivers may not necessarily boast better perform-ance than their counterparts using CML output stages, theiradvantage lies in their potential for lower power operation [1]and their ability to cope with a large range of termination volt-ages, which makes them a prime candidate for multi-standardTX. Given the increasing challenge in achieving acceptable ana-log performance in advanced digital CMOS technologies, the SSTdriver principle is based entirely on digital switching devices thatare optimized for high-speed operation and continue to scale withtechnology. In this paper, the architecture and design of key com-ponents of a half-rate SST TX is presented that implements a ver-satile, power- and area-efficient equalization and impedance-tun-ing scheme. Furthermore, it achieves low jitter and negligibleduty-cycle distortion (DCD), thanks to a clock duty-cycle cleanupcircuit.
Figure 24.6.1 shows a block diagram of the implemented differen-tial half-rate SST TX. All local clocks are derived from a globalhalf-rate CML clock (ck2cml) and are converted to CMOS half-rate (ck2) and quarter rate (ck4) clock, respectively. The 4b quar-ter-rate input data d[0:3] is transformed in a first multiplexerstage to a half-rate interleaved even and odd data stream. Botheven and odd data streams then pass a 4b shift register, that con-sists of 4 latches driven on opposing clock phases of the half-rateclock (ck2) and implements the delayed data taps of a 4-tap FIRpre-emphasis filter (tap[0:3][even, odd]). In order to set the signof any pre-emphasis filter tap, each shift register latch output isfollowed by an XOR gate that selectively (sign[0:3]) inverts thecorresponding filter tap. The resulting 4 even/odd (=8) half-ratetap data streams, along with the half-rate CMOS clock ck2, arethen globally distributed to 44 identical differential half-rate SSTdriver slices, each of which can be configured to select oneeven/odd tap stream out of the available 4×2 tap data streams.Consequently, each SST driver slice can be assigned to any of the4 FIR taps, which adds versatile power- and area-efficient equal-ization capabilities to the TX.
Figure 24.6.2 depicts a schematic of a half-rate differential SSTslice. Each slice is composed of 2 single-ended SST drivers to forma pseudo-differential output stage. One single-ended slice con-tains 2 pull-up/pull-down branches with the corresponding evenand odd data and select transistors, along with a common pull-up/pull-down polysilicon linearization resistor. Even and odddata for both single-ended SST slices are derived from two 4:1input multiplexers to select the data stream corresponding to theassigned FIR position. In order to guarantee stable data and theappropriate timing at the multiplexing output SST stage re-tim-ing latches are introduced. Note that the output impedance of thedriver is equal to the parallel combination of all the SST slices,and it does not matter whether a particular slice is pulling up ordown. Termination-impedance tuning is obtained by some addi-tional logic to disable a certain number of SST slices and to setthem into high impedance (enable=0). A nominal 50Ω impedanceis achieved when 30 SST slices are enabled, leaving a tuningrange of ±14 slices to comfortably cope with process tolerances. AMOS to poly resistance ratio of 1:3 is chosen in this implementa-tion for optimum accuracy/area trade-off.
A particular concern in a half-rate SST TX lies in the fact that itoperates at both clock cycles of the half rate CMOS clock ck2 andany imbalance or DCD has a direct impact on the jitter perform-
ance. Special measures are taken in the CML-to-CMOS clock con-verter, shown in Fig. 24.6.3(a), to cleanup DCD. The first CMLbuffer stage uses DC suppression and acts as a first DCD-cleanupstage [2]. In addition, it provides some gain to maximize the CMLoutput signal swing (out, outb). An AC-coupled inverter withresistive feedback then follows the first CML buffer and acts as aCML to CMOS conversion stage [3]. AC-coupling is a simple andefficient way to remove any DC component and to perform a volt-age level shift to the input of the inverter, which is DC biased toits trip point by means of the feedback resistor. Three taperedCMOS inverters then follow to provide enough drive strength toglobally distribute the half-rate CMOS clock ck2/ck2b to the 44unit SST slices. Care is taken to minimize delay (~ 43ps) in theCMOS clock path for minimum power-supply-noise-induced jit-ter. Furthermore, special attention is paid in the circuit layout tokeep the fully differential clock nets ck2/ck2b symmetrical.
A circuit layout of the implemented SST TX is shown in Fig.24.6.3(b). The macro occupies an area of 230×56µm2 includingESD protection. In order to characterize the performance of theSST TX, a wafer probable test chip is implemented (Fig. 24.6.4).The test chip consists of 2 SST TX lanes that share a commonexternal differential half-rate CML clock input. A serial 3-wireinterface allows control of the TX settings and the 2 bit-patterngenerators that generate independently programmable quarter-rate data for both TX lanes. Furthermore, supply decouplingcapacitors in the order of 50pF/lane are added closely to each TXlane. A chip micrograph is shown in Fig. 24.6.7.
Figure 24.6.5 shows 3 measured PRBS15 differential data eyes at16Gb/s data rate and at 3 different termination voltages Vtt,along with the measured jitter numbers. Sub-ps RJ is measured,which is essentially at the resolution limit of the measurementequipment, while the measured DJ is ~8.5ps and dominated byISI. Split wafer lot measurements of DCD at data rates between5.2 to 12.5Gb/s remain below 600fs. A DC jitter supply sensitivi-ty of -4.6ps/100mV (@Vdd=1V) could be observed. Only a slightdegradation in DJ of <1ps could be observed when switching theneighboring lane in operation. No perceivable difference in thejitter performance nor in the eye opening could be observed forany termination voltage Vtt from 0 to 1V, which proves the versa-tility of the SST TX to cope with different termination standards. The quality of the SST TX clocking path is not only confirmed bysub-ps RJ, but also with the DCD-cleanup performance. Figure24.6.6 shows the measured output peak-to-peak DCD versus anincoming-clock DCD at a data rate of 5Gb/s (2.5GHz CML clock)and different termination voltages Vtt. The measured outputDCD at all termination voltages remains below 0.5%pp at aninput DCD of ±10%. The measured supply current of one SST TXlane including bit-pattern generator at 16Gb/s and differential100Ω termination is 57.5mA at a nominal Vdd of 1V, correspondingto a power-dissipation efficiency of 3.6mW/Gb/s.
Acknowledgements:The authors would like to thank Bhavna Agrawal, Michael Beakes, NickPerez, Steve Walker, and Carl Wermer for analog design enablement andSOI modeling support, and the IBM foundry team for chip manufacturing.
References:[1] H. Hatamkhani, K. J. Wong, R. Drost, et al.,”A 10mW 3.6Gbps I/OTransmitter,” Symp. VLSI Circuits, pp. 97 - 98, Jun., 2003.[2] C. Menolfi, T. Toifl, R. Reutemann, et. al, “A 25Gb/s PAM4 Transmitterin 90nm CMOS SOI,” ISSCC Dig. Tech. Papers, pp. 72-73, Feb., 2005.[3] J. Savoj, B. Razavi, “A CMOS Interface Circuit for Detection of 1.2Gb/sRZ Data,” ISSCC Dig. Tech. Papers, pp. 278 - 279, Feb., 1999.
1-4244-0852-0/07/$25.00 ©2007 IEEE.
447DIGEST OF TECHNICAL PAPERS •
Continued on Page 614
ISSCC 2007 / February 14, 2007 / 11:15 AM
Figure 24.6.1: Half-rate SST TX block diagram. Figure 24.6.2: SST driver half-rate slice architecture.
Figure 24.6.3: (a) CML to CMOS converter with DCD cleanup, (b) Layout of one SST TXcore.
Figure 24.6.5: Measured PRBS15 data eye at 16Gb/s and at different terminationvoltages Vtt.
Figure 24.6.6:Measured output DCD versus input-clock DCD at 5Gb/s and different termination voltages.
Figure 24.6.4: Test-chip block diagram.
FIRshift
register(4 taps)
configuration register
2:1
d0
d2
ck4
out
d0d2d1d3
MUX 4:2
FIRshift
register(4 taps)
CML-to-CMOSClock Generator
tapsign
44 diff. half-rateunit SSTdrivers
ck2cml
ck2/ck2bck4/ck4b
tap 1(even/odd)
tap 2(even/odd)tap 3(even/odd)
tap 0(even/odd)even
odd
sign[0:3]enable[0:28]
config[0,1][0:43]
out+out-
2
2
2
2
globaldistribution
data
incl
ock
in
data
out
2
Vdd
ck2b
Vdd
ck2
ck2
ck2b
dep dop
den don
de
do
endop
den
don
depD Q
E
4:1de_0do_1de_2do_3
4:1do_0de_1do_2de_3
config:tap0,tap1,tap2,tap3
even data mux
odd data mux
re-timing latch
ck2b
Vdd
ck2
ck2
ck2b
dep dop
den don
de
do
endop
den
don
depde
deb
do
dob
ck2
D Q
Eck2b
out+out-
single-ended half-rate SST slice (2x)
enable:0 = off (high impedance)1 = on
2
Vdd
M1C1 / C1B
M1Bck2cml ck2cmlb
OutOutb ck2b
ck2
4x FO2.5 stages ~43ps
ESD+
ESD-
22differentialSST slices
22differentialSST slices
CMOSClock
generator
MUX4:2FIR
56µm
230µm
a)
b)
TX lane14-tap FIR
Half-rate SST driver44 unit slices
Pat. Gen 1PRBS[7|15|31]
8b prog. pattern
d0:d3@ck4
ck4 (CMOS)
2ck2_cml
TX lane04-tap FIR
Half-rate SST driver44 unit slices
Pat. Gen 0PRBS[7|15|31]
8b prog. pattern
d0:d3@ck4
ck4 (CMOS)
2out_lane0
2out_lane1
2ck2cml
3-wire serial interface+ configuration register
3data, clk, load
Testchip
Vtt=0V Vtt=0.5V
Vtt=1.0V
8.35
8.72
8.53
DJ(d-d)[ps]
0.250
0.170
0.120
DCD[ps]
TJ [ps](BER1e-12)
RJ[ps] rms
16Gb/s,PRBS15
10.790.176Vtt=1.0V
11.160.176Vtt=0.5V
10.990.177Vtt=0V
400ps
50%
-10%
+10%
20% p-p
Applied input clock signal@ 2.5GHz
Duty-Cycle Distortion (DCD) @ 5Gb/s vs. input DCD
0
0.1
0.2
0.3
0.4
0.5
0.6
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11
DCD in %
DC
D o
ut %
pk-
pk
Vtt=0.0 Vtt=0.5 Vtt=1.0
24
614 • 2007 IEEE International Solid-State Circuits Conference 1-4244-0852-0/07/$25.00 ©2007 IEEE.
ISSCC 2007 PAPER CONTINUATIONS
Figure 24.6.7: Test-chip micrograph.
decoupling caps
decoupling caps
decoupling caps
SST TX lane 0
SST TX lane 1
data patterngenerators
3-wireinterface &config reg.
ck2cml
ck2cmlb
out+ out-
out+ out-
lane 0
lane 1
1mm x 1mmclockinput