Home >Documents >Serializer and Deserializer (SerDes) for High Speed...

Serializer and Deserializer (SerDes) for High Speed...

Date post:18-Mar-2018
Category:
View:225 times
Download:8 times
Share this document with a friend
Transcript:

Serializer and Deserializer (SerDes) for High Speed Serial Communications

PAGE

7

SerDes Transceivers for High-speed Serial Communications

Dianyong Chen, Shoujun Wang, and Tad Kwasniewski

1 Introduction

In order to process and redistribute digital information, data are constantly exchanged between different systems and also between different functional blocks inside a system. Serial communications and parallel communications currently and historically coexist and serve various requirements of intrasystem and intersystem data exchange. In parallel communications several symbols are sent at one time over a communications link, while in serial communications only one symbol is sent at one time. The choice of one method over another is usually a tradeoff on factors such as speed, cost of materials, power consumption, and difficulty of physical realization. In modern telecommunication systems and computer systems, paralell communications and serial communications are often used simulataneously. Therefore, it is an important task to serialize and deserialize data stream.

2 SerDes Transceiver for High-Speed Serial Communications

In principle, parallel communications are instrinsically faster than serial communications, because the speed of a parallel data link is equal to the number of symbols sent in parallel times the symbol rate of each individual path; doubling the number of symbols sent at once doubles the data rate. For this reason, parallel commuications are widely used in internal buses of integrated circuits and short distance chip to chip links. However, contradicting to superficial instincts, parallel communications are being replaced by serial communications in high-speed data links. These links include chip to chip communications on backplanes, computer networks, computer peripheral buses, long-haul communications, and etc. A conventional reason to choose serial communications instead of parallel communications is cost. In long-haul communication systems, copper cables or optical fibers are expensive; doubling the number of cables or fibers doubles the cost. In chip to chip communications, paralell data links require more pins, which increases the cost of packaging. According to [1] packaging already represents 25% of the total system cost in some electronic products. Nevertheless, the main challenges that deprecate parallel communications in these applications are clock skew [2], [3], data skew, and crosstalk. Skew is the difference in arrival time of symbols transmitted at the same time. Symbols are basically electromagnetic pulses. Because no electromagentic wave can travel faster than the free space light, the time it takes for a signal to travel from the transmitter to the receiver is determined by the length of the electrical trace or optical trace and the group velocity of the signal. Although the difference of arrival time of signals along different pathes is usually very small, it can lead to considerable phase difference in high-speed data links, since the frequency is very high. For example, 1 centimeter path difference causes 240 degree phase difference for 10 GHz clock signals traveling with a velocity that is a half of the free space light speed. Capacitive coupling, components delay, and process, voltage, temperature (PVT) variation deteriorate clock skew and data skew. Clock skew can be corrected by delay-locked-loop (DLL) composed of a variable delay line (VDL) and a control loop due to the periodical nature of the clock signal [4]. In principle, data skew can also be corrected. However, due to the large number of links and analog nature of the received signals, data skew is much more troublesome in parallel data links. As a consequence the system has to slow down to wait for the path with the largest delay. Crosstalk is the interference between adjacent data links. When data rate and the number of links increase, crosstalk also tends to increase. In addition, connectors and vias break the continuity of electromagnetic fileds, and increase the chance of crosstalk [5].

Therefore, in high-speed data links, serial communications are replacing parallel communications rapidly. High-speed serial data links include backplane links such as PCI express, computer networking such as ethernet, computer to peripheral devices such USB, multimedia interface such as HDMI, computer to storage interface such as serial ATA, serial attached SCSI, higsh speed telecommunications such as SONET and SDH.

On the other hand, internal buses of integrated circuits and short distance chip to chip data links use parallel communications to increase data transfer rate and signal processing speed. In addition, massive data are usually stored in slow devices such as RAM; they have to be accessed in parallel to achieve high-speed. A SerDes transmitter serves to transmit those parallel data to the receiver through a high-speed serial data link; the SerDes receiver receives data from the serial data link and delivers parallel data to next stage electronic circuits for further signal processing. A simplified SerDes transceiver is shown in Fig.2.1.

Clock

Data

Source

Source

Encoder

Channel

Encoder

PISO

Tx Buffer

Physical

Channel

Source

Decoder

Channel

Decoder

SIPO

Bit

Detection

Rx Buffer

Physical

Channel

Clock

Data Sink

Fig. 2.1. A block diagram of a simplified SerDes transceiver

The data source is basically a set of user information to be transmitted. It may be files, or audio video streams, etc. The data source usually has some kind of paralellism and the unit is often byte, or word, or double-word. Before transmssion, source encoding is usually performed. Tasks in the source encoding include framing, synchronization (SYNC) patterning, forward error correction (FEC) encoding and etc. These tasks sometimes involve very complicated algorithms and paralellism must be used to increase processing speed. The channel encoding is usually arranged in such a way that after channel encoding the data spectrum becomes a better fit to the physical channel and the bit detection in the receiving end becomes easier. For example, the prevailing 8B/10B channel coding achieves a maximum run length of 5 and maximum digital sum variation of 6 [6]. The output of the channel encoder is then fed to a parallel-in-serial-out (PISO) block to generate a serial symbol stream. Symbols in the stream may be binary or multilevel. The representation of symbols is sometimes called line code, or signaling, or channel pulse modulation. Typical binary line codes are non-return-to-zero (NRZ), non-return-to-zero-inverted (NRZI), Manchester code (return-to-zero), etc. Typical multilevel line codes are duo-binary (3 levels), 4 level pulse amplitude modulation (PAM-4), etc. The serial symbol stream is sent to a transmitter driver, or Tx buffer to convert to proper electrical pulses or optical pulses that can travel through the physical channel to the receiving end. The PSIO and Tx buffer are usually controlled by a symbol rate (or baud rate) clock. In very high-speed serial data links, this clock is generated by a clock multiplier unit (CMU) out of a crystal oscillator. In some applications, a Tx equalizer (pre-emphasis) is implemented before the Tx buffer to counteract some channel impairments.

There are many sources of impairments in the physical channel. Those impairments include frequency dependent attenuation, frequency dependent delay (dispersion), crosstalk, reflection, and noise. Those impairments may be nonlinear and time-variant. However, in engineering practice, a linear time-invariant mathematic model describing the transceiver is still very useful. Fig.2.2 shows a linear mathematic model of SerDes Transceivers.

Pulse Modulator

Transmission Line

S

noise

a

k

c(t)f(t)

r(t)

n(t)

r

k

Fig. 2. 2. A linear mathematical model describing the transceiver

The input data stream ak is the output of the PISO. It is a digital sequence which is discrete both in time and amplitude. When NRZ signaling is used, ak ({0,1} or ak ({-1,+1}. The sampling rate is the baud rate fbaud or clock frequency. The received signal r(t) is a continuous time analog signal.

)

(

)

(

)

(

t

n

iT

t

h

a

t

r

i

i

+

-

=

-

=

(2.1)

where h(t) is the time domain channel impulse response. Here the channel includes the pulse shaping modulator of the transmitter c(t), the physical channel response f(t), and the Tx equalizer response and Rx equalizer response if they are involved. The receiver use a clock to sample the received signal and make a decision to know what is received.

[

]

)

(

)

(

)

(

f

f

f

+

+

+

-

=

+

=

-

=

kT

n

T

i

k

h

a

kT

r

r

i

i

k

(2.2)

The input data sequence ak is usually quite random. Assume there is no additive noise n(t), the received signal rk can be rewritten as

[

]

f

f

+

-

+

+

=

-

-

=

+

=

T

i

k

h

a

h

a

r

i

k

i

k

i

k

k

)

(

)

(

1

1

(2.3)

It can be seen from equation (2.3) that the sampling phase ( is very important. If we sampled at a phase where h(( ) is zero, it would be very difficult to recover data. It is desirable to sample at where h(( ) is at its maximum. The second part of equation (2.3) depends on the sequence before and after ak. It is rather unpredictable because the input sequence is random. This part is usually called inter-symbol-interference (ISI). It has been shown that if the channel response is a Nyquist N-1 pulse, the ISI can be eliminated completely [7]. A Nyquist N-1 pulse is defined as

=

=

+

0

n

when

0,

0

n

when

,

1

)

(

f

nT

h

(2.4)

where n is an integer. However, the response of a physical channel is usually not a Nyquist N-1 pulse, especially when there is little exess bandwidth. The task of an equalizer is to shape the combinational channel response to a Nyquist N-1 pulse. Unfortunately, this is not always possible if noise is taken into account. A Nyquist N-1 pulse has a flat spectrum across the passband. The additive noise is somewhat white. A physical channel response however tends to have large attenuation at high frequency. It even has zeros at some frequencies because of reflection and resonance of parasitics. If the target channel response was a Nyquist N-1 pulse, the equalizer would have enormous gain at frequencies with large attenuation (or even infinite gain at zeros). As a consequence, the noise is boosted much more than the signal at those frequencies. This is called noise enhancement. In practice, linear least-mean-square (LMS) equalization method or nonlinear decision feedback equalization (DFE) is used.

Ironically there is no signal power at the clock frequency, although the sampling clock is so important to data recovery. Most high-speed serial data links are baseband system despite of the very high baud rate. Fiber optical systems such as SONET are indeed modulated system. However, its modulation method and demodulation method are non-coherent. It is usually treated as a baseband system. In Fig.2.2 the pulse shaping modulator is usually a hold function of one symbol period. The power spectral density (PSD) of the received signal is expressed as

2

)

(

)

(

)

(

T

j

T

j

T

j

r

e

F

e

C

e

P

w

w

w

A

=

(2.5)

where A is the PSD of the input sequence, C and F are the Fourier Transform of c(t) and f(t). Since c(t) is just a hold function of one symbol period. Its Fourier Transform is

)

(

sin

1

)

(

sin

)

(

baud

baud

T

j

f

f

c

f

fT

c

T

e

C

p

p

w

=

=

(2.6)

Equation (2.5) and (2.6) show that the received signal power is zero at integer multiples of the clock frequency. Therefore, it is not possible to recover the clock information with linear method. However, in high-speed serial data links, there is no exclusive clock signal. The clock signal has to be derived from the received data waveform.

A clock and data recovery (CDR) circuit is a nonlinear circuit that can recover the clock from the received data waveform which has zero power at the clock frequency. The tasks of a CDR are roughly the follows [8]:

Generate a clock whose frequency is the baud rate.

Adjust the phase of the clock so that it can sample the received data waveform at the time instants when the waveform has maximum signal to noise ratio (SNR).

Recover data with high accuracy in the presence of noise and ISI.

Since there is no signal power at the clock frequency and its harmonics, a CDR can lock to the clock frequency or its harmonics. It does not know which one is correct if without some preknowledge of the data sequence. One method to prevent this is standardization. For example, a single rate CDR for SONET STS-48 should lock to 2488.32 MHz. Therefore, a predefined reference clock can be used. Another method is to have preknowledge of the shortest runlength and the longest runlength of the data source. A counter is used to detect the runlength of a sequence. If the detected shortest runlength is shorter than the predefined shortest runlength, the clock is slower than the data rate; if the detected longest runlength is longer than the predefined longest runlength, the clock is faster than the data rate [9].

Implementation of SerDes transceiver in monolithic microelectronics integrated circuits has much more to consider than the issues mentioned above. In this chapter, we review the representative implementations of main building blocks of SerDes transceivers for high-speed serial data links.

3 Design challenges

It is desirable to implement SerDes transceiver in mainstream CMOS technologies because of their low cost and low power consumption. CMOS circuits are typically slower than circuits implemented in non-mainstream technologies [10]. Although the speed of CMOS circuits is constantly getting faster and the power-consumption becomes lower when scaling down, new circuit styles and architectures enabling low power and high-speed are still very critical for high-speed serial data links.

SerDes transceivers are predominantly mixed signal circuits. With the scaling down of feature size, the supply voltage and threshold voltage drop accordingly, leaving less voltage headroom for signal processing. In the meantime, substrate noise tends to increase [11]. Furthermore, most recent technologies are usually optimized for digital baseband signal processing and lack of accurate models for analog signal processing. Therefore, mixed signal processing is usually vulnerable to PVT variations. Many recent attempts in SerDes transceiver design are to replace analog blocks with digital ones to make the transceiver more robust against PVT variations and noise.

4 Circuit Styles

A high-speed SerDes transceiver is a mixed-signal system as shown in Fig.2.2. On the one hand, it takes a digital sequence ak from the host and passes a digital sequence rk to the client, performing digital signal processing when it is necessary; On the other hand, it must be able to drive the physical channel. The physical channcel distorts and adds noise to the digital signals that travel through it. The received signals become analog. Data recovery relies on a locally regenerated clock and proper sampling. In addition, a high-speed SerDes transceiver is usually a sub-system of a large system, or/and it is used for portable devices. Thus low-power consumption is very critical. Therefore, circuit style must be tailored to meet these requirements.

4.1 High-Speed and Low-Power Digital Circuits

In a high-speed SerDes transceiver, not only source encoder/decoder and channel encoder/decoder are digital circuits, but also phase detector, phase frequency detector, and frequency divider are usually digital ones. Some of these digital circuits run at baud-rate. Therefore, digital circuits must be sufficiently fast while keeping power consumption sufficiently low.

The main factor that limits the speed of a digital circuit is the capacitive load and parasitic capacitive load. The voltage change on a capacitor is be written as

C

T

I

dt

t

i

C

C

Q

V

a

T

t

t

D

=

=

D

=

D

D

+

0

0

)

(

1

(4.1)

where C is the total load capacitance, (T is the time it takes to change the voltage on the load capacitor with an amplitude of (V, and Ia is the average current to charge or discharge the load capacitor. To make a circuit fast, the time (T must be sufficiently small.

a

I

C

V

T

D

=

D

(4.2)

It is quite straightforward how to make a circuit faster: to reduce the voltage swing (V, or/and make the load capacitance C smaller, or/and increase the charging current Ia. A large logic family exploits these fundamental methods to make digital circuit faster. For example, Pseudo-nMOS and Domino-Logic exclude pMOS capacitance from the input, because pMOS input capacitance is usually 2(3 times as large as nMOS input capacitance if they provide the same current. Technology scaling down reduces capacitance as well.

4.1.1 Static Rail-to-Rail CMOS Logic

Static CMOS rail-to-rail logic is by far the most commonly used type of logic ciruit. Despite the very high speed, CMOS rail-to-rail logic is still extensively used in some high-speed SerDes transceivers. The reasons are technology scaling down, reduced power supply voltage, and simplicity, maturity, and robustness of static CMOS logic. Static CMOS logic has a pull-up network and a pull-down network. At any time except transitions, either pull-up network is turned on to pull the output to the power supply voltage or pull-down network is turn on to pull down the output to ground. Since pull-up and pull-down can not be turned on simultaneously except during transitions, in principle satic CMOS logic consumes zero static power. Therefore, static CMOS logic exhibits extremely low power consumption at low frequency applications.

The speed and power consumption of static CMOS logic in high speed applications can be roughly estimated by using two serially connected inverters as shown in Fig.4.1.

P

1

P

2

N

1

N

2

I

P1

I

N1

R

P1

R

N1

Parasitic 1

C

gp2

C

gn1

Parasitic 2

V

in

V

o1

V

o

2

VDD

VDD

Fig.4.1. Two serially connected inverters and the equivalent circuit of the first stage

Assume the initial input signal Vin is high, thus pMOS transistor P1 is switched off, nMOS transistor N1 is switched on, and the voltage Vo1 is low. Lets further assume the input signal has very sharp edges and sufficiently large driving capability, then when Vin jumps from high to low, the time it takes to turn on P1 and turn off N1 is negligible. The voltage Vo1 is pulled up to VDD through transistor P1. However, it can not change abruptly as it has to drive the gate capacitance of P2 and N2, as well as parastic capacitance from the four transistors. When Vo1 increases, VDS of transistor P1 reduces until the channel is not pinched off. Transistor , P1 falls into triode region and the charging current reduces. When Vo1 reaches VDD, the energy (W) stored in the gate capacitors and parasitic capacitors (Ctotal) is

total

a

total

C

T

I

VDD

C

W

2

2

2

2

1

2

1

D

=

=

(4.3)

Since VDD charges the capacitors through the channel of P1, some engergy is consumed by the channel resistance. When the input Vin goes from low to high, pMOS transistor P1 is switched off, and nMOS transistor N1 is switched on. Assume this process is sufficiently fast, then power supply VDD is cut off from the capacitors abruptly so that the power consumption caused by short-circuit effect is negligible, and it provides no energy during the process of discharging the capacitors. However, the stored energy W is completely consumed by the channel resistance of transistor N1 when the capacitors are discharged to the ground. The energy consumption for an input cycle is the sum of W and the energy dissipated in charging process. The average power consumption of an inverter can be estimated as

f

VDD

C

f

T

I

VDD

P

total

a

D

=

2

(4.4)

Some conclusions can be drawn on the basis of the simple analysis. Firstly, static CMOS logic has to drive the input capacitance of the pull-up network and the input capacitance of the pull-down network simultaneously. The pull-up network is composed with pMOS transistors and has larger capacitance. According to equation (4.2), this slows down the circuit because of the large capacitance; Secondly, static CMOS realizes a rail-to-rail output. According to equation (4.2), this also slows down the circuit because of the large swing. According to equation (4.4), this greatly increases the power consumption because of the large swing; Thirdly, according to equation (4.4), static CMOS logic consumes much power at high frequencies because the power consumption is proportional to switching frequency. Last but not least, static CMOS logic is sensitive to common mode noise because it is not differential. Therefore, high-speed CMOS digital design favors current mode logic (CML).

An important observation from the two serially connected inverters is that the output of the first inverter has finite slew rate. This is different from our previously assumption that the input to an inverter has very sharp edges and infinite driving capability. Therefore, the second inverter will not switch instaneously, and additional delay is added. This realistic consideration applies to all digital circuits.

4.1.2 CML Logic

CMOS CML logic is based on differential pairs which is shown in Fig.4.2. At the first glance, we find there are no pMOS transistors. Therefore, it is potentially faster than static CMOS logic. It is fully differential. Therefore, it has excellent immunity to common mode noise. When the input voltage vin is sufficiently large, one of the two branches can be switched off, while the other takes all the tail current I0. The minimum input voltage can be derived using the following equations.

(

)

2

2

,

1

2

2

,

1

2

th

gs

ox

V

V

L

W

C

I

-

=

m

(4.5)

0

2

1

I

I

I

=

+

(4.6)

2

1

gs

gs

in

V

V

V

-

=

(4.7)

Solving equation (4.5)((4.7) leads to an expression of I1 (or I2) [F.3]. The minimum voltage that can fully switch the differential pairs is given when this current is equal to I0. It is

)

/

(

2

)

min(

0

L

W

C

I

V

ox

in

m

=

(4.8)

The voltage swing is

0

0

)

(

)

0

(

RI

I

i

V

i

V

V

=

=

-

=

=

D

(4.9)

The voltage swing is the product of the load resistance and the tail current. Therefore, it is possible to reduce the voltage swing to improve the speed of the circuit. However, excessive reduction of voltage swing reduces the noise margin. In addition, it may not be able to fully switch the next stage differential pairs.

N

1

VDD

N

2

V

in

V

out

I

0

V

in

I

2

1

0.5

I

0

I

1

I

2

R

R

Fig.4.2. Differential pairs in CMOS CML Logic

Similar to static CMOS logic, the speed and power consumption of CMOS CML logic can be estimated by using two serially connected inverters. We still assume the input to the first inverter has very sharp edges and sufficient driving capability, then to the first order approximation, the output of the first inverter is essentially a step response of charging or discharging a capacitor with a current source of finite internal impedance. The change of the output voltage in a branch is written as

)

1

(

)

(

0

2

,

1

RC

t

o

e

RI

t

V

-

-

=

D

(4.10)

where R is the load resistance and C is the load capacitance of the first inverter. Fast switching only relies on small RC. However, to maintain the voltage swing, the tail current I0 has to increase. The speed of differential pairs can be enhanced by using inductive peaking technique. Physically this is quite straightforward. When all tail current is switched to one arm, the additional current is provided by the load resistor and the load capacitor. Since the voltage on a capacitor can not change abruptly, at the very beginning the additional current is almost sololy provided by the load capacitor and the output voltage changes quickly. However, when the output voltage drops, the current provided by the load resistor increases and the current provided by the load capacitor decreases. The slew rate of the output voltage drops. Inductive peaking technique connects an inductor and the load resitor in cascade. The nature of an inductor is that it always tries to hold back the change of current. Therefore, the current provided by the load resistor decreases and the current provided by the load capacitor increase, which helps to quickly drain off the charges stored on the load capacitor, leading to fast change of the output voltage.

The power consumption of a CMOS CML inverter to the first order can be estimated in a quite simple way.

0

I

VDD

P

=

(4.11)

Obviously CML logic consumes static power. However, to the first order estimation, the power consumption is independent on frequency. Therefore, CML is suitable for high frequency applications in terms of speed and power consumption.

4.2 Driving Circuits and Impedance Matching

Generally a high-speed SerDes transceiver must drive a high-speed channel that is usually much longer than the representative wavelength of the signal. If the impedance of the driver does not match the characteristic impedance of the channel, the driver is unable to provide maximum power to the channel because of reflection in the transmitter side; If the characteristic impedance of the channel does not match the impedance of the terminal, the channel is unable to deliver maximum power to the terminal because of reflection in the receiver side. If there is mismatch in both sides, then some energy reflected from the terminal will experience another reflection in the transmitter side. It tooks some time ((T) for this energy to complete this round-trip and suffer some loss. When it comes back, it addes to the signal that is sent at (T later. Therefore, impedance matching is important to high-speed SerDes transceivers. However, this increases power consumption because practical channels have low impedance. Industrial efforts on these topics lead to many standards. Low voltage differential signaling (LVDS) and CML are the two most popular standards.

V

+

I

0

V

-

V

-

V

+

VDD

VDD

VDD

I

b

Receiver

Fig.4.3. A diagram of representative LVDS signaling

The main advantage offered by LVDS is its low voltage swing of 250400 mV, which allows for high-speed interface operation at a very low level of power consumption. In addition, true differential signalling increases the interfaces tolerance to ground mismatch between transmitter and receiver. It also improves signal EMI immunity and compliance [E.1]. Fig.4.3 shows a representative LVDS configuration. In the transmitter side, the driver is configured in a push-pull topology. Matched impedance is added in front of the receiver buffer. In high-speed SerDes transceivers, the signals traveling through the channels are broadband signals. It becomes harder and harder to achieve full band impedance matching when parasitics are considered. Therefore, impedance matching in only one side can not guarantee small reflection in very high speed. For this reason, impedance matching is desirable in both the transmitter side and the receiver side, and the topology of LVDS signaling becomes very similar to CML signaling.

A respresentative configuration of CML signaling is shown in Fig.4.4. Impedance matching is added to both the transmitter and the receiver. Since CML consumes static power, it is quite popular to switch off the driver if not in use.

VDD

V

in

I

0

R

R

VDD

I

1

R

L

R

L

VDD

R

R

Fig.4.4 A block diagram of CML signaling

If channel impairments are negligible and clock and data can be recovered without receiver equalization, high-speed SerDes receiver buffers can be constructed using nonlinear amplifiers such as limiting amplifiers and sense-amplifiers. If receiver equalization is needed, then linear amplifiers are wanted.

5 PISO and SIPO

As discussed in section 2, in the transmitter side, user data are in parellel, a PISO block is needed to convert these parallel data into serial ones to make it possible to transmit them via a high-speed channel. In the receiving end, a SIPO block is needed to reduce the bit rate for the back-end circuit to perform further signal processing. It is straightforward that SIPO has a tree type structure and PISO has a reversed tree structure.

Control Logic

Control Logic

Control Logic

(a) One Stage

(b) Heterogeneous

(c) Binary Tree

Fig.5.1 Typical PISO and SIPO topologies

Typical topologies of PISO and SIPO are shown in Fig.5.1. The one stage structure is slow due to large parasitic capacitance at the converging node. The heterogeous structure is faster, the binary-tree topology is the fastest. For this reason, 2:1 multiplier (MUX) and 1:2 demultipliers (DEMUX) are important elements for high-speed SerDes transceivers.

In high-speed SerDes transceivers, there are usually tens of input ports of PISO and tens of output ports of SIPO. Therefore, in a binary-tree topology, some stages require very high-speed circuits; while some stages do not require very high-speed circuits. Therefore, for those stages requiring very high-speed, CML logic circuits are used; while static CMOS logic circuits can used for those stages not requiring very high-speed.

Shift Register is very useful for PISO and SIPO with a large ratio. In PISO parallel data are loaded to shift registers when a selecting signal is enabled. The parallel data are shifted out in every baud clock cycle when the selecting signal is disabled. In SIPO, serial incoming data are sampled and shifted in every baud rate clock. A lower frequency clock is used to sample the output of each register. Therefore, the outputs of all registers are synchronized.

In order to make PISO and SIPO work, high-frequency clock (baud-rate clock) and low-frequency clock (clock for parallel data) are needed. Therefore, a frequency divider and a clock multiplier are needed.

6 Clock Multiplier Unit

In a high-speed SerDes transceiver, high-speed clock is very important. Usually in the transmitter side, each symbol is generated under the control of a baud-rate clock; in the receiver side, a clock whose frequency is the baud-rate is needed to sample the received data at where SNR is the maximum. Real implementations may vary in some aspects. For example, the clock frequency can be lower than the baud-rate if a multi-phase clock is used. Even though, a very high frequency clocks is still needed in a multi-gigahertz transceiver. It is quite often that the transmitter and the receiver share a high-speed clock. Phase of this clock should be adjusted in the receiver side, because the delay of the channel is usually a prior unknown and timing jitter and noise are added to the received data through the channel, making the sampling phase very critical. The quality of the high-speed clock greatly influences the transceiver performance. Therefore, it should be clean and accurate. A free-running microelectronic integrated oscillator can not meet the requirements. Therefore, the high-speed clock is synthesized from an accurate low-frequency oscillator such as a quartz oscillator whose frequency accuracy is within e.g. (20 ppm. Usually this frequency synthesizer is not required to generate a clock of any frequency within a frequency band. Instead a clock of a fixed frequency or a clock that can be programmed to a few discrete frequencies is wanted. It is called clock multiplier unit (CMU) in a high-speed SerDes transceiver. A CMU can be a common integer-N frequency synthesizer. A dominant and mature technique that is used in the design of a CMU is PLL.

6.1 Basic PLL-Based CMU Structure and Performance

A representative structure of a PLL-based CMU is shown in Fig.6.1 (a). It is composed of a phase-detector, a charge-pump, a loop filter, a voltage controlled oscillator (VCO), and a divider. The charge-pump provides the necessary loop filter action. In classic PLL, the combination of the charge-pump and the RC network is usually replaced with an operational amplifier (op-amp). Although it is highly non-linear in practice, it is customary to assume linearity when analyzing loops that have achieved lock [H.14]. A Linearized model is shown in Fig.6.1. In the model all variables are phases rather than the actual inputs and outputs.

Phase

Detector

Divider

Clock

Out

N

Reference

Charge Pump and Loop Filter

VCO

N

S

f

in

f

e

f

out

(a)(b)

K

d1

-

K

D

K

O

/s

K

d2

t

z

+1

s

s

Fig.6.1 (a) a block diagram of a PLL-based CMU (b) a linear model

6.1.1 Second-Order PLL Dynamics

A PLL circuit is a feedback system that is designed to bring the phase error signal (e to zero. For several reasons, a second-order PLL is a good choice. The first reason is that theoretically a second-order PLL is unconditionally stable [F.1]. Higher order PLL however may lead to instability. In practice, a second-order PLL is not absolutely stable, because a practical phase-detector and divider are sampled systems [H.14]. In addition, there are many parasitic poles. The second reason to choose a second-order PLL is that a first-order PLL can not reduce the phase error (e to zero unless the forward loop gain is infinitely large. The closed-loop transfer function of a second-order PLL-based CMU is written as:

[

]

1

)

/

/(

1

1

)

(

2

+

+

+

=

=

s

N

K

K

s

s

N

s

F

z

O

D

z

in

out

t

t

f

f

(6.1)

where 1/(z is the frequency of the loop zero. For the convenience of analysis, we can omit N so that equation (6.1) is the close-loop transfer function for a classic second-order PLL. The phase error can be written as:

in

z

n

n

e

s

s

s

f

w

w

w

f

+

+

=

1

/

)

/

(

)

/

(

2

2

(6.2)

where the natural frequency (n is defined as:

N

K

K

O

D

n

/

=

w

(6.3)

We can look into the properties of a second-order PLL by investigating the frequency-domain response and time-domain response.

Response to Input Variations and Input Noise

Equation (6.2) has two poles and one 2-fold zero, which represents a high-pass filter. The phase error (e will be sufficiently small if the frequency of the input variations is significantly smaller than the natural frequency (n. In another word, the difference between the input phase signal (in and the output phase signal (out is small. Therefore, the PLL loop tracks the input variations. At frequency much higher than the natural frequency, the phase error (e will be the input phase signal (in, which means the PLL loop does not response to the input; it almost rejects the input variations completely. Therefore, a fundamental property of a second-order PLL is that it tracks the change of input at low frequencies but rejects the change at high frequencies.

Jitter Peaking

Phase noise or timing jitter is an unwanted input variation. We generally want it is attenuated by the loop. However, this idea contradicts our original purpose to force the VCO to track the input (reference) at low frequencies. Therefore, there is a tradeoff between them. In fact, the design of a PLL is a tradeoff among many contradicting requirements. A desirable feature is that a PLL genuinely copies the input to the output at low frequency but rejects the input at high frequency. However, jitter peaking is an unwanted feature of a second-order PLL. The close loop transfer function or equation (6.1) has two poles and one zero. In a Bode-Plot, the transfer function magnitude goes up with a slope of 20 decibels per decade after the zero location. It exceeds unity if both poles are located at higher frequencies than the zero. A flat area with a magnitude above 0 dB appears after the first pole as the first pole contributes -20 dB/dec. The flat area ends and goes down to below 0 dB with a slope of -20 dB/dec after the second pole. In the frequency range where the magnitude exceeds one, jitter peaking appears. If the input jitter frequency is within this range, this jitter will be magnified in the output.

Rejection of Noise

Input jitter (reference noise) is only one noise source of a second-order PLL. Noise is generated in every component in a real circuit. In many linear models, noise is treated as additive. It is useful to look into the transfer function of those noise sources to know if the PLL attenuates or amplifies them. Seen from equation (6.1), the transfer function of the reference noise is of a lowpass nature but may have jitter peaking in a specific frequency range. The transfer function of phase-detector noise and charge-pump noise are similar to equation (6.1). The transfer function of the loop filter noise is

[

]

1

)

/(

/

2

+

+

=

s

K

K

s

K

s

z

O

D

D

in

out

t

f

f

(6.4)

It exhibits a band pass nature. A time-domain step response can reveal much information. Assume this noise causes a frequency error of ((i, the time-domain phase-error can be expressed as [H.14]

)

1

sinh(

)

exp(

1

)

(

2

2

t

t

t

n

n

n

i

e

-

-

-

D

=

-

x

w

xw

x

w

w

f

(6.5)

where ( is the damping ratio defined as

2

z

n

t

w

x

=

(6.6)

The maximum phase-error (max appears at tmax.

)

1

(

sinh(tanh

)

1

(

tanh

1

exp

1

2

1

2

1

2

2

max

x

x

x

x

x

x

x

w

w

f

-

-

-

-

-

D

=

-

-

-

n

i

(6.7)

)

1

(

tanh

1

1

2

1

2

max

x

x

x

w

-

-

=

-

n

t

(6.8)

In the case of high dampening, they can be simplified.

c

i

w

w

f

D

-

max

(6.9)

c

t

w

x

2

ln

2

max

(6.10)

where (c is the crossover frequency (frequency where the open loop transfer function is 0 dB). It is intuitively obvious that the VCO disturbance (jitter) gets integrated over a period of time before the loop takes action to correct it. Jitter integration is inversely proportional to the loop bandwidth (roughly the crossover frequency). Therefore, rejection of loop filter noise requires high loop bandwidth. The transfer function for VCO noise is of a high-pass nature and is analytically expressed as

1

)

/

(

)

/

(

2

2

+

+

=

s

s

s

z

n

in

out

t

w

w

f

f

(6.11)

The noise of a VCO is not white. In a LC resonator, the broadband thermal noise of passive components is shaped by the resonator Q and the normalized phase noise is inversely proportional to the square of frequency offset (((). When the frequency offset is sufficiently large, the noise spectra flatten due to active elements (such as buffers). At sufficiently small frequency offset, the phase noise spectra possess a 1/((()3 region. Leeson has proposed an empirical equation for VCO phase noise [F.2]

D

D

+

D

+

=

D

)

1

(

)

2

(

1

2

10

)

(

3

/

1

2

0

w

w

w

w

w

f

sig

Q

P

FkT

L

(6.12)

where L is the normalized phase noise, F is an empirical factor, ((1/f3 is a fitting parameter, Q is the quality factor of the resonator, and Psig is the signal power.

Charge sharing, current mismatch, and reference feed-through of charge-pump can cause spurs in the PLL output [F.3]. The spurs directly go into the jitter budget of VCO. Therefore, they need careful design.

Acquisition Time

A rough definition of acquisition time is the time it takes for a free-running VCO to lock to the input. In some literatures, acquisition time is divided into frequency acquisition time and phase acquisition time. Assume a free running VCO is running at angular frequency ( and its phase is zero at the time instant t=0. The input signal has a frequency of (+(( and its phase is (0 at t=0. According to the linearized second-order PLL model, the time-domain expression of the phase-error is

-

+

-

-

D

+

-

=

)

1

cosh(

1

)

1

sinh(

)

(

)

exp(

)

(

2

0

2

2

0

t

t

t

t

n

n

n

i

n

n

e

x

w

f

x

w

x

w

w

x

w

f

xw

f

(6.13)

Equation (6.13) is similar to equation (6.5). Therefore, a similar conclusion can be drawn, which is the acquisition time is inversely proportional to the loop bandwidth. Therefore, it generally requires a high bandwidth for fast acquisition.

Control Line Ripple and Higher-Order Poles

A real PLL circuit is not linear at all. Therefore, a linearized second-order PLL model fails to represent a real circuit in some important aspects. One issue needs to be addressed is the ripple on the VCO control line. In the linearized model, it is assumed the phase-detector is a linear subtractor. Under lock condition, the phase-error is zero. Therefore, VCO remains undisturbed when the loop is locked. However, in a real circuit phase-detector or the combination of charge-pump and phase-detector may be highly nonlinear. For example, if the phase-detector is a multiplier-type, there will be higher-order mixing products; if the phase-detector is a bang-bang type, there are always pulses in the phase-detector output. Therefore, in some PLL circuits, there are ripple components on the VCO control line even under lock condition. In general, the ripple components are at high frequencies than the reference. Reducing bandwidth leads to higher attenuation at those high frequencies thus can be helpful. However, rejection of VCO noise and fast acquisition require a high bandwidth. A better solution is to introduce additional poles. Determining how many poles should be added and where those poles should be placed needs careful design. Too many poles can easily degrade the phase margin. It has been shown that one or two higher-order poles can attenuate the ripple by about an order of magnitude or more if they are placed around a factor of 4-7 above crossover [F.4]. Fig.6.2 (a) shows a charge-pump loop filter with one extra pole. Fig.6.2 (b) shows a classic loop filter with an extra pole.

Clock

Out

Phase

Detector

Divider

Clock

Out

N

Reference

Charge Pump and Loop Filter

VCO

(b)

Phase

Detector

Divider

N

Reference

Charge Pump and Loop Filter

VCO

(a)

Up

Down

Fig.6.2 loop filter with one extra pole (a) charge-pump (b) active filter

6.1.2 Divider Delay and Phase-Detector Delay

CMU based on integer-N frequency synthesizer needs a frequency divider in order to synthesize high frequency clock out of a relatively low frequency reference. The divider implies the phase detector is digital in nature. In fact phase detectors and phase frequency detectors for high-speed SerDes transceivers are almost digital ones. Although the loop filter and the VCO may be analog, continuous-time circuits, knowledge about phase error is available to the loop only at discrete instants. It usually involves a sample-and-hold (S&H) operation to convert a continuous-time signal into a discrete-time signal. A zero-order hold (ZOH) function has a transfer function given by

[

]

s

e

dt

e

T

t

u

t

u

s

H

sT

st

-

-

-

=

-

-

=

1

)

(

)

(

)

(

0

(6.14)

The phase of this transfer function is

2

)

2

(

sin

)

(

2

/

T

T

c

T

e

j

H

T

j

w

w

w

w

-

=

-

=

-

(6.15)

The ZOH function adds additional phases (delay) to the loop transfer function. The period of a divider output is usually much larger than the period of the VCO clock. For this reason, its delay is more critical. Divider delay and phase-detector delay erode the phase margin of a PLL loop. As a consequence, the loop bandwidth is forced to decrease in order to avoid these effects. However, a reduced bandwidth may negatively influence settling time and noise performance. In practical implementation, phase comparison rate is set to about 10 times of the crossover frequency [H.14].

6.1.3 Granularity Problems

Since the PLL loop operates on a sampled basis and not as a straightforward continuous-time circuit, it has more stability problems than arise in continuous-time systems. In particular, an analog, second-order PLL is unconditionally stable for any value of loop gain, but the sampled equivalent will go unstable if the gain is made too large. Even a first-order digital PLL can be unstable [7]. It has been shown in reference [F.1] that a second-order PLL which is based on a classic tristate phase-detector and a charge-pump has a stability limit as

+

=

)

1

(

/

1

'

t

w

p

t

w

p

i

i

K

(6.16)

where (i is the angular frequency of the reference, (=RC is the loop filter time constant, and K' is

p

2

2

'

C

R

I

K

K

p

O

=

(6.17)

where Ip is the charge-pump current, KO is the VCO gain, R is a resistor in the loop filter, and C is a capacitor in the loop filter. For a first-order digital PLL, the loop gain should be smaller than 2. However, if a loop delay of M symbol intervals is introduced, the stability range is reduced to [7]

+

)

1

2

(

2

sin

2

0

M

K

p

(6.18)

6.1.4 Digital PLL

A critical problem of a conventional PLL-based CMU is its sensitivity to process variations, noise from power and substrate. Another problem is the limited voltage headroom associated with low-power, deep sub-micrometer CMOS process. In addition, the loop capacitor consumes chip area. Digital PLL is a solution to those problems. In a digital PLL, digital accumulator replaces the loop capacitor (integrator), and a DCO (digitally-controlled-oscillator) replaces the VCO. In general, LC oscillators have superior phase noise performance to ring-oscillators. However, it is difficult to make LC oscillators digital. A solution has been proposed in [F.5]. The DCO is a typical differential negative-resistance LC oscillator. Instead of one varactor, many varactors are used. The varactors are arranged in serval banks and are connected in parallel. CMOS varactors made with low-voltage deep sub-micrometer technologies exhibit very narrow linear tuning range. Interestingly the capacitance-tuning voltage curve looks like the input-output curve of a CMOS inverter. Therefore, the varactors can be made digital by setting the tuning voltage to two proper values. The differential varactor can be as small as a few attofarads (aF) [F.6]. Good frequency resolution can be achieved by switching a unit varactor on or off. Finer frequency resolution can be achieved by applying sigma-delta modulation to the unit varactor. The DCO enables a true phase-domain signal processing. Therefore, spurs due to nonlinearity is greatly suppressed [F.7]. In addition, the whole digital PLL can be retimed to the VCO clock. For this reason, the digital switching noise is mixed to become DC offset. The asynchronym between the VCO oscillation and system reference clock is compensated by using a time-to-digital converter (TDC).

Tradeoffs in PLL-Based CMU Design

A PLL-based CMU faces many conflicting requirements. The following table shows some tradeoffs of a PLL-based CMU design.

High Bandwidth

Low Bandwidth

Rejection of input noise, PD noise, and charge pump noise

(

Rejection of loop filter noise and VCO noise

(

Fast acquisition

(

Reduction of jitter integration

(

Rejection ripple on VCO control line

(

Improve loop stability against parasitic poles

(

6.2 Basic DLL-Based CMU Structure and Performance

In PLL-based CMUs, the output clock is directly derived from the VCO oscillation, and the loop filter has lowpass filtering effect for the input. If the reference is noisier than VCO oscillation, there is an obvious advantage. However, in practice, the reference is much cleaner than the VCO oscillation in high-speed SerDes transceivers. Therefore, it is desirable to directly derive the output clock from the reference. This idea is a fundamental concept behind DLL-based CMU design. Basic DLL-based CMU structures are edge-combiners and cyclic reference injection ring oscillators.

6.2.1 Tunable Delay Cells

Intrinsic delay of logic gates can be used in DLL. If N identical gates are serially connected, the total delay is N times the delay of a unit cell. Adding or removing one gate results in a change of the total delay. In high-speed SerDes transceivers, the delay of such a unit cell is not trivial compared with a symbol period. In addition, the delay is process dependent Therefore, this method can not provide very fine phase resolution and can only be used for coarse tuning [F.8], [F.9]. The beauty of this method is that it provides a kind of digital delay. If the unit delay provided sufficiently fine phase resolution, the problem of unit cell mismatch and process dependence would be solved by advanced digital signal processing algorithms such as calibration. Fig.6.3 shows a CMU based on digitally tuned delay cells [H.31].

Phase

Detector

Digital

Integrator

Time to Digital

Converter

DividerDCO

Reference

C[0]C[1]C[N]

C[m]C[m+1]C[m+N]

Fig.6.3. A CMU based on digitally tuned delay cells

Delay is a physical process. It can be roughly classified into two catalogues. The first is caused by charging or discharging a capacitive load. The second is caused by finite propagation speed. A representative of the second type may be a piece of transmission line. However, it is hard to tune its delay. The first type provides some freedom to tune the delay, since the time it takes to charge or discharge a capacitive load is determined by the current, the load capacitance and the voltage swing, we can tune the delay by adjusting the current or/and the capacitance if the voltage swing is fixed.

V

ctr

V

out

I

(a)

VDD

V

in

I

V

in

V

out

VDD

V

ctr

(b)

V

out

+

(c)

VDD

V

out

-

V

in

+

V

out

-

V

ctr

V

ctr

Fig.6.4 some schemes to tune unit delay cells

Fig.6.4 shows some schemes to tune the delay of a unit cell [F.10], [F.11]. Fig.6.4.(a) is a current-starved-inverter. The control voltage Vctr can control the current that flows through the inverter to the load capacitor to tune the delay. Fig.6.4.(b) is an RC delay cell. Vctr can control the on-state resistance of the nMOS transistor that connects the capacitor. Fig.6.4.(c) is a differential delay cell. The control voltage can control the current that flow through the positive feedback pMOS transistors. The idea to control delay through current can also be found wide use in phase interpolators and phase mixers [F.12], [F.13].

6.2.2 Edge Combining DLL-Based CMU

A DLL can only generate one delayed version or some delayed versions of the reference. A DLL-based CMU must include additional circuits to generate an output clock whose frequency is an integer multiple of the reference frequency. Edge combiner serves this purpose.

Fixed-Ratio Edge Combining CMU

In a fixed-ratio edge combing CMU, the reference is delayed by an N stage voltage-controlled-delay-line (VCDL). A phase detector detects the phase difference between the reference and the output of the last unit-delay-cell. When the VCDL has a total delay of 2(, the phase error is zero and the DLL loop is locked. Assume the unit-delay-cells are identical, each stage will have a delay of 2(/N. The edge combining logic circuits combines edges of each stage. The highest output clock will have a frequency that is N/2 times of the reference [F.14]. A fixed-ratio edge combing CMU is shown in Fig.6.4.

Phase

Detector

Reference

Up

Down

Edge Combiner

Clock Out

f

0

f

1

f

2

f

3

f

0

f

1

f

2

f

3

f

4

f

5

Clock Out

Fig.6.4. Fixed-Ratio edge combining CMU

Despite the simplicity of the fixed-ratio edge combining CMU, it has a few problems. The first problem is its susceptibility to false lock to harmonics, because the reference is a periodic signal, a delay of 2k( is equal to a delay of zero; furthermore, the unit-delay-cell usually has large tuning range. The second problem is its fixed ratio between input and output. The third problem is that it is sensitive to mismatch. In practice it is impossible to make the unit delay cells identical. As a consequence, the output clock will have strong pattern dependent jitter and duty cycle mismatch.

A number of solutions can be used to prevent false lock to harmonics. One method is to use lock detector [F.15]. Information of lock to harmonics can be detected if phase information of each unit delay cell is used.

Programmable Edge Combing CMU

A programmable edge combining CMU extends the applications of edge combing CMU. A straightforward method is to build some logic circuits to selectively feed the delayed clocks ((0, (1, ... , (N) to the edge combiner [F.16], [F.17]. In [F.17], a total number of N identical delay cells are connected in cascade to form a VCDL. A multiplier factor controller can select M delay cells out of them and mask the rest. The output of the last delay cell is feed to the phase detector. Therefore, under lock condition, the delay of each cell is 2(/M. At the rising edge of the output of any of the M delay cells, the edge combiner toggles its output. Therefore, the frequency of the output of the edge combiner is M/2 times the frequency of the reference. Since the number M can be programmed, the edge combining CMU is programmable.

There are many challenges in designing of a programmable edge combing CMU. Firstly, it usually incurs complicated logic circuits that are slow and power consuming. Secondly, the problems of an ordinary edge combing CMU remain. Those problems include harmonic locking, and susceptibility to mismatch.

6.2.3 Cyclic Reference Injection DLL-based CMU

It is not an easy work to solve mismatch in high-speed and low power deep submicrometer CMOS processes. Therefore, edge combining CMUs tend to have high spur in the output. Conventional ring oscillators are not susceptible to mismatch, because the oscillation circulates all delay cells. Conventional ring oscillators can be made programmable by using a programmable divider too. However, conventional ring-oscillator-based CMUs suffer from jitter integration. Cyclic reference injection can solve the problem [H.1]. Fig.6.6 shows a block diagram of a cyclic reference injection DLL-based CMU. The circuit can work in a ring oscillator mode or a direct delay line mode, depending on the switch (MUX). When the delay cells are connected in the form of a ring oscillator, the CMU becomes a conventional PLL. It suffers from jitter accumulation, but not from mismatch. However, the jitter accumulation can only persist for several periods and will be eliminated periodically by an injected clean reference. However, it is very challenging to align the edge of the reference and an edge of the oscillation of the ring oscillator. This usually leads to high spur in the output.

PD

Reference

Up

Down

Clock Out

MSel

Fig.6.6 A block diagram of cyclic reference injection CMU

6.2.4 Reduction of Phase Noise

In a DLL-based CMU, there are many phase noise and timing jitter sources. The main sources include the in-lock error due to the mismatch in the charging and discharging current sources in the charge pump, the mismatch of the phase detector outputs, and the phase noise due to the mismatch among the delay stages, edge combining cells in the edge combiner based ones, or the re-alignment error caused by the reference injection in the cyclic reference injection multipliers. All of those errors can be considered as the systematic in-lock error.

A number of techniques can be applied to mitigate some of the problems. In [F.18], the static phase error due to the imbalance of the mismatches in PFD/CP is compensated by adding a second low-bandwidth loop. The compensation loop is digital, and it comprises of a bang-bang PD and an accumulator to implement an integrator with infinite DC gain, and the output of the integrator controls a current digital-to- analog-converter (DAC) that leaks current from either side of the charge pump. The harmonic-locking problem of a cyclic reference injection CMU is solved in [F.19] by adding a logic circuit to dynamically control the switch and the divider. In [F.20], chopping, auto-zeroing and various other circuit techniques are employed to reduce static phase offset and crosstalk between the reference and the output clock.

6.3 Comparison between PLL-Based CMUs and DLL-Based CMUs

According to Mark A. Horowitz, in CMU design one can mess up DLL and PLL, because either has its own strength and weakness. If designed correctly, either works well, and jitter will be dominated by other sources [F.21]. Simple comparisons are made in the following paragraph in terms of implementation easiness, jitter accumulation, jitter transfer, stability, and acquisition time.

If the reference is much cleaner than the VCO, DLL-based CMUs are more advantageous than PLL-based CMUs in terms of jitter performance. Conversely, if the reference is nosier than the VCO, then PLL-based CMUs may be better than DLL-based CMUs. Both PLL and DLL have filtering effect and jitter-peaking effect to input jitter. PLL always exhibits a lowpass filtering effect to input jitter; DLL exhibits an allpass filtering effect to input jitter, but the transfer function can be changed to a lowpass one if some techniques such as loop filtering and phase filtering are involved [F.22]. Jitter accumulation is generally a more serious problem to PLL than DLL; while spurs are commonly a more severe problem to DLL-based CMUs than PLL-based integer-N CMUs. PLL and DLL are usually modeled as second-order linear feedback system and first-order linear feedback system, respectively. However, both of them can be unstable, since they are sampled systems and parasitic poles exist.

7 Equalization

In the past years, transfer rate of high-speed serial data links has been ever increasing. Meanwhile cheap and low quality transmission lines are still extensively used in many applications to save cost. With the increased data rate, various impairments become more and more severe. For example, Fig.7.1 shows a measured backplane transfer function [G.1]. The physical channel exhibits considerable phase distortion and amplitude attenuation at frequencies above 4 GHz. The time domain impulse response of this channel

Fig.7.1. Measured performance of a Tyco backplane

The physical channel exhibits considerable phase distortion and amplitude attenuation at frequencies above 4 GHz. The time domain impulse response of this channel dampens considerably after one symbol period from the time instant of the amplitude peak if baud rate is low, for example below 1 Gbps. In this case, ISI is not a serve problem if the maximum runlength is constrained. However, the time domain impulse response of this channel does not dampen sufficiently even after several symbol periods after the time instant of the amplitude peak, if the baud rate is high, for example 10 Gbps. In this case, some zero crossing points may be missing and some sampling points at the data center change their polarity. The eye is completely closed and clock and data recovery is impossible without proper equalization.

In principle, the impairments can be reduced considerably by replacing the low quality transmission lines with high quality ones or by equalization [G.2]. High quality transmission lines tend to vastly increase the cost; while sub-micrometer CMOS technologies and equalization usually provide excellent cheap solutions. The impairments of physical channel are also strongly dependent on the length of the transmission lines. A short transmission line may not need any equalization. In some applications, the length of the transmission lines may vary considerably and their prosperities is not time invariant. Therefore, considerable efforts in the design of high-speed serial data links are paid to adaptivity.

7.1 A catalogue of equalization schemes

Equalization methods can be linear or non-linear. Equalizer can be implemented in the transmitter side or receiver side or both sides. In microelectronic circuit implementation, it can be either continuous time (un-sampled) or discrete time (sampled). The signal amplitude can be discrete (digital) or continuous (analog). The equalizer can also be either adaptive or fixed, and the adaptive algorithm can be zero forcing (ZF) or LMS (or minimum-mean-square-error: MMSE) or some nonlinear approaches. In addition, the equalizers target response can be either full response or partial response. In the case of sampling equalizer, it can be baud rate sampled or over-sampled. Filter design can be either FIR or IIR. Therefore, there is quite a big set of combination of the above equalization schemes. Each scheme has its pros and cons regarding to a specific application or a specific history. The equalization schemes profoundly interact with CDR structure.

7.2 Tx equalization

Tx equalization is usually called pre-emphasis as it always tries to emphasize the frequencies where high attenuation is located or de-emphasize the frequencies where low attenuation is located to make a flat frequency domain response across the passband. Since the clock is readily available in the transmitter side; the inputs to the equalizer are usually binary data; and the noise from the channel does not play an important role. Tx equalizers are usually simpler than Rx equalizers. However, Tx equalizers do not have the ability of adaptivity unless a backchannel is added. In addition, Tx equalizer is constrained by the peak power of the transmitter as the power is truly consumed by the load resistors [G.3.].

Tx equalizer can be implemented in continuous time analog circuits [G.4.]. This kind of equalizer is simply a continuous time analog high-pass filter. It has limited tuning range and a constant group delay is difficult. Tx equalizers are almost exclusively of a discrete time finite impulse response (FIR) feature [G.5.], since the clock is available and the input to equalizer is digital. Usually FIR Tx equalizers are implemented in symbol spaced current domain [G.6], [G.7], [G.8], [G.9]. It does not help the loss at Nyquist frequency. This scenario is shown in Fig.7.2. The tap delay is achieved by D type filp flop (DFF), and tap coefficents are controled by bias current. The bias current is set by a digital to analog converter (DAC). The control of the bias current usually involves linear transconductors and current mirrors.

0

)

0

(

)

(

I

I

C

i

C

i

=

(7.1)

=

-

=

N

i

in

out

i

k

D

i

C

k

D

0

)

(

)

(

)

(

(7.2)

DFFDFF

D

in

P

D

inN

Clock

I

o

I

1

I

N

Q

P

Q

N

Q

P

Q

N

D

out

P

D

out

N

Fig. 7.2. A current domain Tx equalizer

A Tx equalizer can also be implemented in the time domain by utilizing pulse width modulation [G.10]. In this scheme, the duty cycle of baseband shaping pulse c(t) shown in Fig.2.2 is not a hold function for one symbol period, but a biphase or Machester code pulse whose duty cycle is manipulated to shape the combinational channel impulse response. This may be beneficial in deep sub-micrometer CMOS circuits, because the time resolution is better than the voltage resolution in deep sub-micrometer CMOS circuits. Furthermore, this solution is less constrained by the voltage headroom. However, this solution may not work well when the transfer function of the physical channel is complicated due to reflection and parasitic resonance. Duty cycle manipulation has insufficient variations to match channels with complicated transfer function.

7.3 Rx continuous time analog linear equalizer

Rx equalizer has much more varieties of implementations than Tx equalizer, and adaptivity can be realized in Rx equalizer. Physical channel of high-speed serial data links is usually of a low-pass nature. A passive high-pass filter (HPF) or active HPF is able to flatten the joint frequency domain channel response and reduce ISI. The HPF can be continuous time analog filter composed of passive RLC network [G.11]. It can also be active filters based on operational power amplifier (Opamp) [G.12] or transconductor-capacitor (Gm-C) [G.13.].

Fig. 7.3. A high HPF cell and its transfer function

CML buffer is a kind of natural equalizer. This has been exploited in [G.6] and [G.14]. As shown in Fig.7.3, the low frequency gain can be tuned by a MOS resistor M1 and the high frequency gain can be tuned by varactors Cd1 and Cd2. The high frequency gain can be further boosted by an on-chip inductor [G.15.]. In addition, the HPF cells are usually connected in cascade to give more gain at high frequencies.

Rx continuous time analog linear equalizers are not sampled. Therefore, clock jitter does not affect their performance. However, there are many challenges for this kind of equalizer. Some challenges are listed as follows

It has limited tuning range and rarely matches channel, especially when there are both frequency dependent attenuation and frequency dependent delay.

Linearity is a challenge, especially when input swings vary greatly in amplitude.

Limited by gain bandwidth of each stage of differential-pair.

It is sensitive to PVT variations.

It is sensitive to device mismatch and non-linearity.

Offset cancellation and calibration are difficult.

Multi-stage can achieve high gain, but it can also lead to clipping.

Continuous time analog Rx linear equalizers are sometimes used as pre-equalizer for decision feedback equalizer. The task is to make the impulse response causal, with most of its energy concentrated in the time origin (with some fixed delay). It is also desirable to have a noise whitening filter functionality so that the DFE works best [7].

7.4 Rx FIR equalizer

The basic structure of an Rx FIR equalizer is shown in Fig.7.4. The main building blocks are delay cell, multiplier and adder. There are many variations in practical implementations, since each block can be either continuous time or discrete time, either analog or digital.

The delay cell can be implemented in continuous time analog circuits. An ideal delay cell is an all-pass filter whose group delay is constant but tunable. An ideal all-pass filter is not realizable. In practice, low pass filter is used to approximate it. Active Opamp-MOSFET-C filter using Bessel type polynomials [G.16], [G.17], [G.18] or Gm-C filter using Butterworth polynomials work very well up to gigahertz [G.19]. Inverter-based delay units with active inductor load (INV-AIL) are reported in fractionally spaced equalizer up to 2.5 Gbps [G.20]. In very high baud rate such as 40 Gbps, passive LC network or transmission line is usually used as delay cell [G.21]. The continuous time analog delay cells are not affected by the clock jitter. They enable Rx continuous time analog FIR equalizer. However, they also suffer from the challenges for high-speed analog CMOS design.

Delay cells can also be implemented in discrete time manner. The tapped delay can not be a simple DFF, which is very different from Tx FIR equalizer, because the received signals are basically analog. The tapped delay usually incurs sample and hold.

DelayDelayDelayDelay

Data In

C

N

C

1

C

0

C

N-1

Data Out

Fig. 7.4. Schematic diagram of Rx FIR equalizer

The multiplier is usually implemented in the current domain. A linear voltage to current (V-I) converter (transconductor) is needed to convert the voltage of each tap to current [G.19], [G.22]. The coefficients of the equalizer are realized by weighting the current. The coefficients are first normalized so that they do not exceed one. The current is led to a network of differential CMOS pairs. In any time one transistor of the differential pair is in the triode region while the other is off. The drain of all differential pairs are connected to the output of the V-I converter. In each pair, the source of one transistor is connected to the ground, while the source of the other transistor is connected to a shared output resistor. The differential pairs are controlled by weight setting logic circuits. When all transistors connected to the ground are on, there is no current passing through the shared resistor and the weight is zero. If they are all off, the weight is one. The weighted current is mirrored. The current adder can be simply realized by connecting mirrored current together.

Rx FIR equalizer is difficult to implement. If it is continuous time, the delay cells are difficult, with little flexibility and limited tuning range. If it is discrete time, it is susceptible to clock jitter. If it is digital, it is very challenging for high-speed. In addition, if the discrete time equalizer is symbol space sampled, the output only contains samples at data center. Additional efforts are needed to find the zero-crossing points if threshold type CDR is used. Furthermore, Rx FIR equalizers are linear equalizers which tend to amplify noise and crosstalk. Therefore, in practice, Tx FIR equalizer and Rx decision feedback equalizer are more commonly used in SerDes transceivers.

7.5 Decision feedback equalizer

Decision feedback equalizer has many advantages over linear equalizers in signal processing and microelectronic circuit implementations. A simple DFE diagram is shown in Fig.7.5. The feedback equalizer has the same structure as a Tx FIR equalizer, thus the circuits and techniques for Tx FIR equalizer directly apply to the feedback filter. From the systems point of view, DFE does not need to have a flat spectrum across the passband, thus does not enhance noise and corsstalk. In addition, although noise still injects into the feedback filter through each tap, the noise level is reduced by the nonlinear decision circuit.

The impulse response of the physical channel is rarely minimum phase or with most of its energy concentrated near the time origin. Suppose the maximum amplitude appears at n2T+( ((=n2-1, the precursor ISI is effectively zero. The residual ISI is zero if the channel impulse response only lasts for a period shorter than the taps of the feedback filter. As a physical channel usually does not satify these requirements, a pre-filter is needed to reshape the channel response to effectively reduce precusor ISI and reduce the taps of the feedback equalizer. This can be done with a Tx equalizer or/and a prefilter (feed forward equalizer FFE) in the receiving end [G.23], [G.24], [G.25], [G.26]. The FFE can be implemented in either continuous time [G.27] or discrete time.

DFE can be implemented in either continuous time [G.28] or in discrete time. When it is implemented in continuous time, it is not affected by clock jitter. From equation (7.3) we can see that DFE is much less sensitive to clock jitter than linear FIR equalizer. Nevertheless, a direct implementation of DFE can consume significant power, area, and complexity since it involves resolving the previous data and using them to add an analog value to the input within the next symbol period [G.24]. Loop unrolling can relax the requirements. Fig.7.6 shows a one-tap unrolled DFE.

Mux

-a

+a

x(n)

QD

Fig. 7.6. schematic diagram of an unrolled DFE

The received signal x(t) is sampled at time nT+( and the sampled signal is denoted as x(n). If the first two taps of the impulse response of the channel is 1 and (. The signal to the decision circuit is

-

=

-

+

+

=

-

-

=

-

-

=

-

=

1

)

1

(

when

,

)

(

1

)

1

(

when

,

)

(

)

1

(

)

(

)

(

)

(

)

(

k

a

k

x

k

a

k

x

k

a

k

x

k

f

k

x

k

y

a

a

a

(7.4)

The decision circuit will make a decision based on y(k). As we see from equation (7.4), the decision can be made according to the comparison between x(k) and +( or x(k) and -(. The previous symbol a(k-1) is used to choose +( or -(. This unrolls the DFE into the structure shown in Fig.7.6, where no multiplier and summation are needed. The unrolling method can be extended to partial response DFE and PAM-4 DFE. Unfortunately the complexity of DFE goes up by 2N (N is the number of taps) and the calibration of offset levels are needed. In addition, the probability of error propagation goes up when number of taps increases. In practice, Tx FIR pre-emphasis and FFE are used together with DFE. Those linear equalizers cancel precursor ISI and DFE cancel postcursor ISI.

7.6 Digital equalizer

Digital equalizers have many advantages over analog equalizers. It does not inject noise from each tap node, which is a problem for FIR based analog equalizers, and it is not sensitive to PVT variations and power and substrate noise. In addition, it is possible to implement much more sophisticated equalization method in digital domain. Nevertheless, digital equalizers require very ADC, which is very challenging in high-speed serial data links. However, with the scaling down of CMOS processing, a baud rate digital equalizer has been reported using 65 nm CMOS technologies [G.29]. According to Nyquist sampling theorem, a sampling frequency higher than baud rate is needed if there is frequency offset or phase drifting in the incoming data. A sample rate converter is wanted when the sampling clocks frequency is not the baud rate [G.30].

7.7 Bandwidth compression

Full response is aimed to eliminate all ISI at data centers, which requires a flat (folded) spectrum across the passband. However, a physical channel always exhibits a strong lowpass nature. In another word, in many applications, full response equalization does not match the channel response well. Partial response [G.31] is not aimed to eliminate all ISI, but rather to utilize the controlled ISI or constructive ISI. Therefore, it requires fewer boosts for high frequencies, has less noise enhancement and matches the channel better. Duobinary is a widely used partial response method [G.32]. Duobinary has the advantage to eliminate ISI at data transitions. For this reason, an equalizer aiming at a target response as duobinary is also called edge equalizer [G.33]. This is beneficial for clock recovery. However, the received signal has 3 levels because of ISI. Recovery of one bit datum becomes dependent on the current received symbol and its previous symbol. This leads to possible error propagation. Error propagation can be circumvented by a precoding technique, such as Tomlison-Harashima precoding [G.34], [G.35]. The precoding technique also applies to DFE. Two-level signaling is also possible if the shortest transitions are removed [G.33].

Strictly speaking, duobinary is not a bandwidth compression method. Duobinary signaling has the same symbol rate and bit rate rate as NRZ. If there is a hard cutoff frequency, the maximum bit rate that duobinary achieves is the same as NRZ. In reality, duobinary is usually regarded as a bandwidth compression methond because it does not require a flat spectrum as target response. True bandwidth compression method is based on multi-level such as PAM-4. Equalization methods for multi-level transceiver are basically the same as binary system. However, the microelectronic implementation is more complicated because of the additional levels [G.26]. Multi-level transceivers are usefull for highly loss channels.

7.8 Adaptation

In high-speed serial data links, it is desirable for equalizers to have adaptivity. Firstly, there is an ensemble of fixed channels, but which one is available is a priori unknown. For example, the length of a high-speed Ethernet cable may vary; the properties of backplane channels are different due to fabrication variations. Second, a physical channel may not be time invariant. Adaptive equalizer is usually implemented in the receiving end, because the channel properties can be detected from the receiving signals. Adaptation is not possible in the transmission side unless a back channel is specified.

S

noise

a

k

h(t)

r(t)

n(t)

r(k)

c(k)

y(k)

S

a

k

e(k)

Minize [e(k)]

2

delay

Fig.7.7. A block diagram of an adaptive linear equalizer

Fig.7.7. shows a block diagram of an adaptive linear equalizer. The sampled received signal r(k) contains various impairments such as ISI and noise. The linear filter c(k) is able to eliminate much ISI if the coefficients are correctly set to match the channel. The difference between the output of the equalizer and the original data a(k) is

)

(

~

)

(

)

(

)

(

~

)

(

)

(

0

k

a

n

c

n

k

r

k

a

k

y

k

e

N

n

-

-

=

-

=

=

(7.5)

where (k) is a delayed version of the original data sequence a(k) to match the channel delay. In practice we do not know the original data sequence. Otherwise, we do not need any equalization or adaptation, since the original data sequence is already known. However, (k) can be an exact replica of (k) if correct decisions have been made.

If the coefficients have been correctly set, y(k) should be equal to of (k), and e(k) should be always zero. In practice, we inevitably have noise and residual ISI. Therefore, our cost function (error) is defined as the square of the e(k). To optimize the system, we need to bring the error to its minimum. Conventionally gradient descent method is very useful to find the minimum. It is done by iteration.

[

]

x

-

=

+

)

(

2

1

k

e

c

c

n

n

(7.6)

where is the coefficient vector, and ( is the step. Since we have assumed a linear filter, the gradient can be derived as

[

]

)

(

)

(

2

)

(

2

k

r

k

e

k

e

(

=

(7.7)

where (k) is a vector defined as

[

]

)

(

...

)

1

(

)

(

)

(

N

k

r

k

r

k

r

k

r

-

-

=

(

(7.8)

Equation (7.7) shows that the gradient is reduced to the product of e(k) and (k). Even though, in practice it is still very challenging to design analog multiplier to meet the very high speed of SerDes transceivers. To simplify the iteration method, the step ( is made very small and the gradient is replaced with the sign of e(k) and the sign of (k), which is

[

]

[

]

)

(

)

(

1

k

r

sign

k

e

sign

c

c

n

n

(

-

=

+

x

(7.9)

This approach will be very sensitive to data patterns and noise distribution if the coefficients are updated with baud rate. Therefore, instead of minimizing the instantaneous error, we minimize the expectation of the error. In practical SerDes receivers, the effectiveness of equalization is dependent on the sampling phase of the clock, and the recovered clock depends on equalization too. Therefore equalizer adaptation loop interacts with the CDR loop. To avoid the two loops fight each other, the bandwidth of the two loops must be much different. In practical design, the CDR loop has a much wider bandwidth than the equalizer adaptation loop. We can write the cost function as

min

)

(

)

(

)

(

~

2

0

-

-

=

=

N

n

n

c

n

k

r

k

a

E

error

(7.10)

For this reason, the adaptation algorithm is called least mean square (LMS). The algorithm according to equation (7.9) is called sign-sign LMS. LMS may be the most widely used algorithm in equalizer adaptation.

S

noise

a

k

h(t)

r(t)

n(t)

Minize [e(k)]

2

delay

S

S

y(t)

z(t)

c(k)

b(k)

(k)

y(k)

(k)

e(k)

+

+

+

-

+

-

-

Fig.7.8. A block diagram of adaptive decision feedback equalizer

A block diagram of adaptive DFE is shown in fig.7.8. The cost function is

min

)

(

)

(

~

~

)

(

2

1

-

-

-

=

=

N

n

n

c

n

k

a

a

k

z

E

error

(7.11)

Equation (7.11) shows that the DFE is mimicking the channel, since [1, c(1), c(2), ] is actually the normalized channel impulse response if the error is zero. However, in practice a normalization factor is priori unknown, or z(k) is not normalized. For this reason, practical design may add one more adaptive parameter. It may be a adaptive reference level or a variable gain.

8 Data and Clock Recovery

In high-speed SerDes transceivers where there are strong impairments, ISI is severe and eye is usually closed. After equalization, the impairments are compensated to a certain degree. ISI is reduced to an accept level and eye is open. Since most equalization schemes are aimed to eliminate the ISI at data centers and the system impulse response is at its maximum at those positions, it is desirable to sample the incoming waveform at the data centers because they have the maximum SNR. An abstract eye diagram is shown in Fig.8.1 [H.1]. A practical transition probability density curve may look quite different from the one shown in Fig.8.1.

t

u

t

r

t

h

t

m

T

2

Clock

Resample

Clock

Recovery

Data In

Data Out

Transition Probability

Density

Fig.8.1 An abstract eye diagram

As mentioned in section 2, there is no exclusively allocated clock signal in high-speed SerDes transceivers. However, a clock signal whose frequency is the baud rate and whose phase is aligned to the data centers is indispensable for correct data recovery. As shown in Fig.8.1, a nonlinear clock recovery circuit exacts clock information from the incoming data waveform, and uses this regenerated clock to resample the data waveform to recover the data.

8.1 Design Considerations for CDR

Since the signal power at clock frequency or the baud rate is essentially zero in the incoming data waveform, clock signal has to be generated locally in the receiving end through nonlinear methods. A microwave filter type clock recovery scheme or more commonly called nonlinear spectral line method usually incurs a nonlinear device such as a rectifier, and a bandpass filter [H.2]. The nonlinear device generates signal power at the clock frequency and the filter eliminates spectra at other frequencies. Nevertheless, it is hard to achieve a high-Q bandpass filter with wide tuning range in microelectronic circuits. A more commonly used method is to have a local oscillator to generate a clock signal. The phase error and frequency error between the local oscillation and the incoming data waveform can be detected. The detected errors are feed to control loops that are able to minimize those errors. This type of clock recovery scheme usually involves a phase locked loop. Paractical microelectronic imple

of 50/50
SerDes Transceivers for High-speed Serial Communications Dianyong Chen, Shoujun Wang, and Tad Kwasniewski 1 Introduction In order to process and redistribute digital information, data are constantly exchanged between different systems and also between different functional blocks inside a system. Serial communications and parallel communications currently and historically coexist and serve various requirements of intrasystem and intersystem data exchange. In parallel communications several symbols are sent at one time over a communications link, while in serial communications only one symbol is sent at one time. The choice of one method over another is usually a tradeoff on factors such as speed, cost of materials, power consumption, and difficulty of physical realization. In modern telecommunication systems and computer systems, paralell communications and serial communications are often used simulataneously. Therefore, it is an important task to serialize and deserialize data stream. 2 SerDes Transceiver for High-Speed Serial Communications In principle, parallel communications are instrinsically faster than serial communications, because the speed of a parallel data link is equal to the number of symbols sent in parallel times the symbol rate of each individual path; doubling the number of symbols sent at once doubles the data rate. For this reason, parallel commuications are widely used in internal buses of integrated circuits and short distance chip to chip links. However, contradicting to superficial instincts, parallel communications are being replaced by serial communications in high-speed data links. These links include chip to chip communications on backplanes, computer networks, computer peripheral buses, long- haul communications, and etc. A conventional reason to choose serial communications instead of parallel communications is cost. In long-haul communication systems, copper cables or optical fibers are expensive; doubling the number of cables or fibers doubles the cost. In chip to chip communications, paralell data links require more pins, which increases the cost of packaging. According to [1] packaging already represents 25% of the total system cost in some electronic products. Nevertheless, the main challenges that deprecate parallel communications in these applications are clock skew [2], [3], data skew, and crosstalk. Skew is the difference in arrival time of symbols transmitted at the same time. Symbols are basically electromagnetic pulses. Because no electromagentic wave can travel faster than the free space light, the time it takes for a signal to travel from the transmitter to the receiver is determined by the length of the electrical trace or optical trace and the group velocity of the signal. Although the difference of arrival time of signals along different pathes is usually very small, it can lead to considerable phase difference in high- speed data links, since the frequency is very high. For example, 1 centimeter path difference causes 240 degree phase difference for 10 GHz clock signals traveling with a velocity that is a half 1
Embed Size (px)
Recommended