+ All Categories
Home > Documents > ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT … · 2011-12-25 · ISSCC 2005...

ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT … · 2011-12-25 · ISSCC 2005...

Date post: 13-Jun-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
3
294 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE. ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT / 16.2 16.2 A 90nm Variable-Frequency Clock System for a Power-Managed Itanium ® -Family Processor Tim Fischer, Ferd Anderson, Ben Patella, Sam Naffziger Intel, Fort Collins, CO The Montecito[1] processor contains two Itanium-family cores with Foxton[2] technology on a 1.7B-transistor die. The variable- frequency clock system (Fig 16.2.1) consists of a single PLL [3] that generates a multiple 6M20 of the system clock frequency distributed to 14 digital frequency dividers (DFDs) for division to the proper zone frequency. Each DFD (Fig. 16.2.2) consists of a DLL and a state machine that dynamically selects among 64 DLL phases generated from the PLL clock. This allows the DFD out- put frequency to vary according to F DFD =F PLL /(1+D/64) where 0D63, yielding a range of 1.0 to 0.504F PLL in 1/64 th increments. Clock zones consist of 2 cores each with 3 DFDs; one 1GHz DFD for Foxton technology control; one DFD for each of 6 front-side bus (FSB) stripes; one DFD for bus logic. Each DFD output clock is distributed to second level clock buffers (SLCBs) for delay tun- ing to 1ps resolution via active deskew. 35 regional active deskew (RAD) phase comparators are distributed in each core to actively deskew neighboring SLCBs, yielding <10ps of skew across the 21.5mm by 27.7mm die [4]. SLCB clocks are distributed to 7536 clock vernier devices (CVDs) per core for local delay fine-tuning via scan. Gaters provide a final gain stage, power-saving enables, and pulse shaping for low latching overhead and skew compen- sation through transparency [5]. To configure the clock-system components, a “translation table” determines PLL, DFD, and aligner divisors from pin-selected sys- tem clock frequencies (200, 266, 333, 400MHz) and fuse-selected bus-logic and core clock frequencies. Fuses determine the core startup (“safe”) and limit frequencies. The clock system has two frequency modes: fixed and variable (FFM and VFM, respective- ly). The clock system starts in FFM and is then placed into VFM by firmware. In FFM, 13 of 14 DFDs are frequency and phase aligned; the 14 th is always a fixed 1GHz for Foxton-technology power-management algorithms. The 13 aligned DFDs have identical fixed divisors: 0 for maximum FFM frequency, and > 0 to achieve a “safe” startup frequency before entering VFM. On power-up or reset, DFD DLLs start and lock on the PLL clock autonomously. Once all 13 DLLs lock, DFD dividers start synchronously and remain phase/fre- quency locked to the PLL clock. In FFM, the core, FSB, and bus logic clocks align to the external system clock by a phase aligner system. This aligner adjusts DFD phase selection using up/down controls, sliding the phase around without changing frequency. At startup the aligner elimi- nates built-in core/FSB route mismatch [4] and aligns both to the system clock to within 20ps across PVT. DFD clock synthesis allows phase adjustment in uniform 1/64 cycle steps with virtu- ally infinite range. In fact, an inversion error on first silicon in the bus logic clock tree (due to logic equivalence escape) is trans- parently corrected by the aligner at startup with no added skew or functional impact due to the adaptive design. In VFM, core DFD frequency (F CORE ) dynamically tracks core voltage (V CORE ) via a programmed regional voltage detector (RVD) voltage-frequency (V-F) response in the voltage-to-frequency con- verter (VFC) loop. The RVD consists of a one-cycle delay-line with a programmable mix of RC and FET delay. This delay-line output is applied to a phase comparator to produce T CORE adjust signals UP, DN (down) and DZ (deadzone). The DZ capability con- trols VFC loop stability. The RVD delay, its RC composition and the DZ width are all scan programmed at startup by hardware and system software. High VFC bandwidth can track power-managed V CORE modula- tion as well as high-frequency switching transients. A new fre- quency is selected with 1.5 cycle average response to a local volt- age change event. This frequency change is distributed to latch- es in ~700ps [4] (Fig. 16.2.3). In each VFC cycle, a DFD utility clock edge: 1) propagates 2400μm to an RVD; 2) a comparator pro- duces an UP, DN, or DZ request; 3) routes 2400μm back to DFD; 4) PCSM arbitrates and resolves comparator meta-stability, and 5) produces a divisor adjust set up to next clock edge at the DFD. The high bandwidth greatly reduces CPU exposure to voltage- transient-induced timing issues enabling F CORE to track a voltage transient of up to 30mV/ns with 700ps of lead time on average. In VFM, DFDs synthesize an F CORE range of F PLL to F PLL /2 in 1.6% steps over a V CORE range of 0.8V to 1.2V. DFDs receive inputs from 4 local RVDs and from other same-core DFDs. The DFD phase compensator state machine (PCSM) arbitrates RVD requests and same-core DFD inputs to derive local DFD divisor adjusts which (a) preserve intra-core DFD phase lock and (b) track programmed V-F response. All DFDs start synchronously in safe mode using phase 0, and same-core DFD phase lock is maintained in VFM by PCSM arbitration. Test features include: on-die clock shrink (ODCS) [6], clock-edge manipulation in the DFD; 4 self-calibrated salmon ladders for deterministic test trigger transport between clock domains; a 2-pin clock-observation test port. Figure 16.2.4 shows simulated VFC response to a V CORE tran- sient. The fast VFC response allows reduction of frequency guard-banding normally used to insure critical path timing dur- ing supply transients. This increases VFM performance over FFM as a function of on-die supply noise (Fig. 16.2.5) which has been observed using an on-die power-measurement circuit [9] to be about 70mV pp . Figure 16.2.6 shows silicon waveforms of core and bus-logic clocks at startup in FFM and later in VFM: the bus logic voltage and frequency remain fixed at 1.2V/1.6GHz while in this case the cores are running at 1.2V/2.14GHz. Clock-system results on first silicon included full functionality of FFM, VFM, and all units described above. The clock system has been shown to operate at up to 2.5GHz at 1.2V, and enabled first silicon boot of Linux, HPUX, and Windows on multiple platforms with Foxton technology. Acknowledgements: The authors recognize the dedicated efforts of a talented and many-skilled team in designing, verifying, and debugging the Montecito clock system. References: [1] S. Naffziger et al., “The Implementation of a 2-core, Multi-Threaded 64b Itanium®-Family Processor,” ISSCC Dig. Tech. Papers, Paper 10.1, Feb., pp. 182-183, 2005. [2] C. Poirier et al., “Power and Temperature Control on a 90nm Itanium®- Family Processor,” ISSCC Dig. Tech. Papers, Paper 16.7, pp. 304-305, Feb., 2005. [3] K. Wong et al., “Cascaded PLL Design for a 90nm CMOS High- Performance Microprocessor,” ISSCC Dig. Tech. Papers, pp. 422-424, Feb., 2003. [4] E. Fetzer et al., “Clock Distribution on a Dual-Core Multi-Threaded Itanium®-Family Processor,” ISSCC Dig. Tech. Papers, Paper 16.1, pp. 292-293, Feb., 2005. [5] S. Naffziger, et al., “The Implementation of the Itanium2 Microprocessor,” IEEE J. Solid-State Circuits, pp. 1448-1459, Nov., 2002. [6] S. Tam et. al., “Clock Generation and Distribution for the First IA64 Processor,” IEEE J. Solid-State Circuits, Nov., 2000. [7] F. Anderson et al., “’The Core Clock System for the Next Generation Itanium Processor,” ISSCC Dig. Tech. Papers, pp. 146-148, Feb., 2002. [8] S. Tam et al., “Clock Generation and Distribution for the Madison Processor,” DTTC2002. [9] E. Alon et al., “Circuits and Techniques for High-Resolution Measurement of On-Chip Power Supply Noise,” Symp. VLSI Circuits, pp. 102-105, June, 2004.
Transcript
Page 1: ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT … · 2011-12-25 · ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT / 16.2 16.2 A 90nm Variable-Frequency

294 • 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE.

ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT / 16.2

16.2 A 90nm Variable-Frequency Clock Systemfor a Power-Managed Itanium®-FamilyProcessor

Tim Fischer, Ferd Anderson, Ben Patella, Sam Naffziger

Intel, Fort Collins, CO

The Montecito[1] processor contains two Itanium-family coreswith Foxton[2] technology on a 1.7B-transistor die. The variable-frequency clock system (Fig 16.2.1) consists of a single PLL [3]that generates a multiple 6≤M≤20 of the system clock frequencydistributed to 14 digital frequency dividers (DFDs) for division tothe proper zone frequency. Each DFD (Fig. 16.2.2) consists of aDLL and a state machine that dynamically selects among 64 DLLphases generated from the PLL clock. This allows the DFD out-put frequency to vary according to FDFD=FPLL/(1+D/64) where0≤D≤63, yielding a range of 1.0 to 0.504FPLL in 1/64th increments.

Clock zones consist of 2 cores each with 3 DFDs; one 1GHz DFDfor Foxton technology control; one DFD for each of 6 front-sidebus (FSB) stripes; one DFD for bus logic. Each DFD output clockis distributed to second level clock buffers (SLCBs) for delay tun-ing to 1ps resolution via active deskew. 35 regional active deskew(RAD) phase comparators are distributed in each core to activelydeskew neighboring SLCBs, yielding <10ps of skew across the21.5mm by 27.7mm die [4]. SLCB clocks are distributed to 7536clock vernier devices (CVDs) per core for local delay fine-tuningvia scan. Gaters provide a final gain stage, power-saving enables,and pulse shaping for low latching overhead and skew compen-sation through transparency [5].

To configure the clock-system components, a “translation table”determines PLL, DFD, and aligner divisors from pin-selected sys-tem clock frequencies (200, 266, 333, 400MHz) and fuse-selectedbus-logic and core clock frequencies. Fuses determine the corestartup (“safe”) and limit frequencies. The clock system has twofrequency modes: fixed and variable (FFM and VFM, respective-ly). The clock system starts in FFM and is then placed into VFMby firmware.

In FFM, 13 of 14 DFDs are frequency and phase aligned; the 14th

is always a fixed 1GHz for Foxton-technology power-managementalgorithms. The 13 aligned DFDs have identical fixed divisors: 0for maximum FFM frequency, and > 0 to achieve a “safe” startupfrequency before entering VFM. On power-up or reset, DFD DLLsstart and lock on the PLL clock autonomously. Once all 13 DLLslock, DFD dividers start synchronously and remain phase/fre-quency locked to the PLL clock.

In FFM, the core, FSB, and bus logic clocks align to the externalsystem clock by a phase aligner system. This aligner adjustsDFD phase selection using up/down controls, sliding the phasearound without changing frequency. At startup the aligner elimi-nates built-in core/FSB route mismatch [4] and aligns both to thesystem clock to within 20ps across PVT. DFD clock synthesisallows phase adjustment in uniform 1/64 cycle steps with virtu-ally infinite range. In fact, an inversion error on first silicon inthe bus logic clock tree (due to logic equivalence escape) is trans-parently corrected by the aligner at startup with no added skewor functional impact due to the adaptive design.

In VFM, core DFD frequency (FCORE) dynamically tracks corevoltage (VCORE) via a programmed regional voltage detector (RVD)voltage-frequency (V-F) response in the voltage-to-frequency con-verter (VFC) loop. The RVD consists of a one-cycle delay-linewith a programmable mix of RC and FET delay. This delay-lineoutput is applied to a phase comparator to produce TCORE adjustsignals UP, DN (down) and DZ (deadzone). The DZ capability con-trols VFC loop stability. The RVD delay, its RC composition andthe DZ width are all scan programmed at startup by hardwareand system software.

High VFC bandwidth can track power-managed VCORE modula-tion as well as high-frequency switching transients. A new fre-quency is selected with 1.5 cycle average response to a local volt-age change event. This frequency change is distributed to latch-es in ~700ps [4] (Fig. 16.2.3). In each VFC cycle, a DFD utilityclock edge: 1) propagates 2400µm to an RVD; 2) a comparator pro-duces an UP, DN, or DZ request; 3) routes 2400µm back to DFD;4) PCSM arbitrates and resolves comparator meta-stability, and5) produces a divisor adjust set up to next clock edge at the DFD.The high bandwidth greatly reduces CPU exposure to voltage-transient-induced timing issues enabling FCORE to track a voltagetransient of up to 30mV/ns with 700ps of lead time on average.

In VFM, DFDs synthesize an FCORE range of FPLL to FPLL /2 in 1.6%steps over a VCORE range of 0.8V to 1.2V. DFDs receive inputsfrom 4 local RVDs and from other same-core DFDs. The DFDphase compensator state machine (PCSM) arbitrates RVDrequests and same-core DFD inputs to derive local DFD divisoradjusts which (a) preserve intra-core DFD phase lock and (b)track programmed V-F response. All DFDs start synchronouslyin safe mode using phase 0, and same-core DFD phase lock ismaintained in VFM by PCSM arbitration.

Test features include: on-die clock shrink (ODCS) [6], clock-edgemanipulation in the DFD; 4 self-calibrated salmon ladders fordeterministic test trigger transport between clock domains; a 2-pin clock-observation test port.

Figure 16.2.4 shows simulated VFC response to a VCORE tran-sient. The fast VFC response allows reduction of frequencyguard-banding normally used to insure critical path timing dur-ing supply transients. This increases VFM performance overFFM as a function of on-die supply noise (Fig. 16.2.5) which hasbeen observed using an on-die power-measurement circuit [9] tobe about 70mVpp. Figure 16.2.6 shows silicon waveforms of coreand bus-logic clocks at startup in FFM and later in VFM: the buslogic voltage and frequency remain fixed at 1.2V/1.6GHz while inthis case the cores are running at 1.2V/2.14GHz.

Clock-system results on first silicon included full functionality ofFFM, VFM, and all units described above. The clock system hasbeen shown to operate at up to 2.5GHz at 1.2V, and enabled firstsilicon boot of Linux, HPUX, and Windows on multiple platformswith Foxton technology.

Acknowledgements:The authors recognize the dedicated efforts of a talented and many-skilledteam in designing, verifying, and debugging the Montecito clock system.

References:[1] S. Naffziger et al., “The Implementation of a 2-core, Multi-Threaded64b Itanium®-Family Processor,” ISSCC Dig. Tech. Papers, Paper 10.1,Feb., pp. 182-183, 2005.[2] C. Poirier et al., “Power and Temperature Control on a 90nm Itanium®-Family Processor,” ISSCC Dig. Tech. Papers, Paper 16.7, pp. 304-305, Feb.,2005.[3] K. Wong et al., “Cascaded PLL Design for a 90nm CMOS High-Performance Microprocessor,” ISSCC Dig. Tech. Papers, pp. 422-424, Feb.,2003.[4] E. Fetzer et al., “Clock Distribution on a Dual-Core Multi-ThreadedItanium®-Family Processor,” ISSCC Dig. Tech. Papers, Paper 16.1, pp.292-293, Feb., 2005.[5] S. Naffziger, et al., “The Implementation of the Itanium2Microprocessor,” IEEE J. Solid-State Circuits, pp. 1448-1459, Nov., 2002.[6] S. Tam et. al., “Clock Generation and Distribution for the First IA64Processor,” IEEE J. Solid-State Circuits, Nov., 2000.[7] F. Anderson et al., “’The Core Clock System for the Next GenerationItanium Processor,” ISSCC Dig. Tech. Papers, pp. 146-148, Feb., 2002.[8] S. Tam et al., “Clock Generation and Distribution for the MadisonProcessor,” DTTC2002.[9] E. Alon et al., “Circuits and Techniques for High-ResolutionMeasurement of On-Chip Power Supply Noise,” Symp. VLSI Circuits, pp.102-105, June, 2004.

Page 2: ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT … · 2011-12-25 · ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT / 16.2 16.2 A 90nm Variable-Frequency

295DIGEST OF TECHNICAL PAPERS •

Continued on Page 599

ISSCC 2005 / February 8, 2005 / Salon 8 / 2:00 PM

Figure 16.2.1: Montecito clock system topology. Figure 16.2.2: DFD / PCSM block diagram.

Figure 16.2.3: Voltage-to-frequency converter loop (VFC).

Figure 16.2.5: VFM performance versus supply noise. Figure 16.2.6: FFM/VFM core/bus clock oscilloscope traces.

Figure 16.2.4: VFM supply transient tracking response (simulated).

Fixed Supply

Variable Supply

Bus Clock

I/Os

Foxton

Bus Logic

matched input routes

Core0

Core1

1/N

CVD

RAD

Gater

1/1

GaterCVD

RVD

DFD

DFD

SLCB CVD GaterPLL

DFD

SLCB CVD Gater

SLCB

DFD

CVD Gater

1/M

DFD

DFD

1/NRAD

SLCB

SLCB

FrequencyTranslation Table

M,NDFDs

DivisorsFrequencies

Fuses

Pins

BalancedBinary-TreeCore ClockDistribution

PLL CLOCK DIFFERENTIAL INPUT

64PHASES

STATE MACHINE

FULLFREQUENCYDIFFERENTIAL “UTILITY”CLOCKROUTES TO CLOCKSYSTEM

PCSM

PERIODADJUST+2 TO -1

16-PHASE DLL ANDINTERPOLATION

TO / FROM SAME-COREPCSMS

STARTUP CONTROL

RVD UP / DOWNREQUESTS

DIVIDEBY 2

DIVIDEBY 2

ODCSCONTROL

SCAN AND TRIGGERS

½ FREQUENCY QUADRATURE DIFFERENTIAL CLOCK ROUTES TO SLCBS

DELAY

DFDCLKOUT

VDDIN

SELECT NEW FREQUENCYOTHER DFD ZONES

LOCAL DFD ZONE SELECTS NEW FREQ.RVD UP/DN REQUEST TO PCSM

CYCLE'S DFD CLOCKCOMPARE DELAYED CLOCK TO NEXT

f(VDD)CLOCKDFD

VFC RESPONSE TIMING

VDD-TO-CLOCK ADJUST DELAY < 2 CYCLES

RVD[3:0]

DIVIDERFREQUENCYDIGITIAL

TO LATCHESDISTRIBUTIONCLOCK

CLKPLL

DFDCLKOUT4PCSM DFD

VDDin

DFD ZONESINFO TO/FROM OTHERFREQUENCY ADJUST

DETECTORVOLTAGEREGIONAL

MACHINESTATECOMPENSATORPHASE

Sup

ply

Vol

tage

(mV

)Fr

eque

ncy

(MH

z)

1800

1700

1000

1200

1100

Simulation Time (nS)

RVD8 (IN DFD ZONE 2)

RVD0 (IN DFD ZONE 0)

DFD ZONE 0 RESPONSE

DFD ZONE 2 RESPONSE

Vco

re

0 500 1000

1.01

1.02

1.03

1.04

1.05

1.06

1.07

Time (ps)

Vco

re

2

4

6

8

10

12

14

16x 10

-3

a) Simulated VFM Performance Increase vs. Supply Noise

0

2

4

6

8

10

0 20 40 60 80 100 120Supply Noise (mV)

VFM

Per

form

ance

Incr

ease

(%

)

Average(50 MHzNoise)

Peakb) Time-domain waveformmeasurements of on-die Vcore supply noise at 1.4GHz, 1.2V, high CPU activity factor

c) Power spectral density of the same

b)

a)

10-1

100

101

-60

-55

-50

-45

-40

-35

-30

-25

-20

-15

-10

Frequency (GHz)

PS

D (d

BV

)

50% Activity Power Virus

6a. FFM, 1.2V

6b. VFM, 1.2V

Core clock 1.6GHz

Bus Logic clock 1.6GHz

Core clock 2.14GHz

Bus Logic clock 1.6GHz

16

Page 3: ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT … · 2011-12-25 · ISSCC 2005 / SESSION 16 / CLOCK DISTRIBUTION AND POWER MANAGEMENT / 16.2 16.2 A 90nm Variable-Frequency

599 • 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE.

ISSCC 2005 PAPER CONTINUATIONS

Figure 16.2.7: Die micrograph.

21.5

mm

21.5

mm

27.7 mm27.7 mm

PLL / PLL / translationtranslationtable / table / clockclockcontrolcontrol

FSBFSBDFDsDFDs

Foxton Controller DFDFoxton Controller DFD

FSBFSBDFDsDFDs

CoreCoreDFDsDFDs

RVDsRVDs

CORE 1CORE 1

CORE 0CORE 0

Bus Logic DFDBus Logic DFD


Recommended