+ All Categories
Home > Documents > ISSCC 2012 / SESSION 3 / PROCESSORS / 3 -...

ISSCC 2012 / SESSION 3 / PROCESSORS / 3 -...

Date post: 18-Feb-2018
Category:
Upload: trinhnhi
View: 217 times
Download: 0 times
Share this document with a friend
3
68 2012 IEEE International Solid-State Circuits Conference ISSCC 2012 / SESSION 3 / PROCESSORS / 3.7 3.7 Resonant Clock Design for a Power-Efficient High-Volume x86-64 Microprocessor Visvesh Sathe 1 , Srikanth Arekapudi 2 , Charles Ouyang 2 , Marios Papaefthymiou 3,4 , Alexander Ishii 3 , Samuel Naffziger 1 1 AMD, Fort Collins, CO 2 AMD, Sunnyvale, CA 3 Cyclos Semiconductor, Berkeley, CA 4 University of Michigan, Ann Arbor, MI AMD’s 4+ GHz x86-64 core codenamed “Piledriver” employs resonant clocking [1-4] to reduce clock distribution power up to 24% while maintaining a low clock-skew target. To support testability and robust operation at the wide range of operating frequencies required of a commercial processor, the clock system operates in two modes: direct-drive (cclk) and resonant (rclk). Leveraging favor- able factors such as the availability of two thick top-level metals, high operating frequency, clock-load density, and the existing clock-design methodology [5], the rclk mode was designed to enable both reduced average power dissipation and improved peak-power-constrained performance, with minimal area impact. This work represents a volume production-enabled implementation of resonant clock technology, and is plan of record for mid-2012 product offerings. Rclk allows power reduction by recycling charge using LC-resonance, which enables further power reduction by reducing clock driver strength. Figure 3.7.1 shows a simplified schematic of the dual-mode clock system. The mode switch MSw is closed (open) in rclk (cclk) mode. The clock driver features a pulse-drive mode for additional efficiency improvement through duty cycle control of the pull-up and pull-down switches. TSw is a throttle switch employed to reduce voltage overshoot when the MSw is turned off during frequency changes. To operate in both modes, the clock driver needs to support frequency-depend- ent drive-strength and pulse modulation, both of which are efficiently imple- mented using a split-buffer topology. In rclk mode, drive strength is modulated through drvEn settings during P-state transitions. Pulse drive is used to enable a finer trade-off between conduction and switching losses in the driver. A local delay line delays only the asserting edges of the pull-up/down stage during pulse drive (plsEn = 1), whereas respective de-asserting edges are triggered by the non-delayed clock. Thus, the driver output duty cycle is obtained by program- ming the local delay to modulate the input duty cycle. This pulse-shaping method has three advantages: 1) Enabling PLL duty cycle control of the clock to tune performance; 2) guaranteeing robust clock slew and amplitude when oper- ating off the V-f curve; and 3) reducing susceptibility in rclk skew due to process variation in the low-delay local delay chains. Figure 3.7.2 shows the Piledriver global clock construction in which a set of five horizontal-folded clock trees (HCK tree) drive a global clock grid [5]. Each HCK tree has up to 25 inductors interleaved with clock drivers. The clock mode and frequency-dependent clock parameter settings (inductor connection, drive strength, pulse width) are adjusted during power-up and each P-state transition, during which time the clock mode parameters are initialized through a P-state indexed fuse table. The power reduction achieved from rclk in each P-state is accounted for by the power management unit. The clock mode parameters are loaded by a sequencer in the transmit block, which distributes them to the HCK trees through a source-synchronous bus inside the vertical clock tree module. Once received by the HCK trees, these parameters are broadcast to all clock driv- ers within each HCK tree. To avoid a circular dependence between the global clock and logic used to program the clock, all programming logic in the HCK trees is clocked by a broadly distributed intermediate stage of the clock tree. Existing clock gating mechanisms are leveraged to prevent the exposure of tim- ing elements in the CPU to transitional clocks. Building inductors with a good quality factor Q is critical to rclk efficiency, and is constrained by several factors. The inductor windings have to be designed to share metal resources on the top two metal layers (M10 and M11) with dense power distribution. Moreover, they must accommodate a substantial number of pre-clock distribution nets and global nets that are routed through, as well as under the inductor. Figure 3.7.3 illustrates inductor design under these con- straints. At the frequencies of interest, Q is dominated by winding resistance. The inductor was therefore designed using M10 and M11, with cut-aways to allow maximal use of the metal layers in the presence of routes and power-sup- ply trunks. Inductor placement was directed so that power-supply trunks pass through the middle of the inductor, minimizing the impact of inductive coupling. Effectively utilizing hitherto unused top-level metal resources in inductor design helped avoid adverse IR impact. The power grid under the inductor was designed to be “loop-less” to mitigate Q degradation resulting from eddy losses, while maintaining a robust grid. Five different inductors were built in the 0.6-to-1.3nH range, for selection based on local clock loading. At 4GHz, inductor Q factors achieved were in the range of 3.5-3.8. Figure 3.7.4 shows the structures required to support rclk (MSw, inductor, TankCap) that are tiled across the HCK-tree. MSw connects the inductor to the clock grid through the Driver-MSw shorting-bar. Skew was controlled by using an LP formulation to perform inductor allocation on the grid, and through inter- leaved driver/inductor placement. For each inductor, MSw size was tuned to trade-off reduced switch resistance with the increased switch parasitic capaci- tance that results from larger switches. For efficient rclk operation, a large, low- ESR TankCap is required within a limited allocated area. To that end, a capacitor structure of approximately six times the average clock load was implemented using both metal and gate structures. Figure 3.7.5 shows measured Cac (defined as Cac = P dynamic / V 2 f) savings and efficiency numbers, based on power dissipation in the clock drivers and grid, in cclk and rclk modes. A high-switching activity test pattern was used for the clock power measurement. Efficiency increases up to 3.3GHz, and declines more grad- ually at higher frequencies. The inherent asymmetry in energy efficiency on either side of the resonant frequency is increased due to a voltage-dependent Q (from the series-connected MSw) and a stringent clock slew criterion that requires a stronger drive at lower frequencies. Full-chip simulation analysis showed a 1ps increase in rclk skew compared to cclk. Figure 3.7.6 shows cclk, and rclk waveforms with different drive strength config- urations from a full-chip clock simulation at 1.2V, 4.25GHz. The rclk_3/8 mode uses clock drivers that are 3/8 of the clock driver strength in cclk mode. Reducing clock driver strength in rclk enables greater Cac savings at the expense of reduced clock slew rates. These reduced slews result in increased cross-over current in the clock receivers. Measurements however, indicate a negligible change in efficiency for high-activity workloads as compared to idle workloads, indicating that this effect is small. Reduced slew also causes a push-out in the 50% arrival time of the clock, potentially affecting both gater-enable paths and cross-clock domain communication. Static timing analysis with degraded slews was run on the core, and resulting paths fixed. Figure 3.7.7 shows the microphotograph of the Piledriver core. Over the frequen- cy range 3.2-to-4.4GHz, the power savings from rclk enable either a frequency increase of about 100 MHz for the same power, or a power reduction of 5-10% for the same frequency. Acknowledgements: The authors thank Tom Meneghini, Kyle Viau, Manivannan Bhoopathy, Joohee Kim, Jerry Kao, Fred Brauchler, Alan Arakawa, Syed Obaidulla, Kevin Hurd, Vasant Palisetti, and Denny Renfrow for their valuable contribution to this work. References: [1] A.J. Drake, et al., “Resonant Clocking using Distributed Parallel Capacitance,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1520-1528, 2004. [2] V.S. Sathe, et al., “Resonant Clock Latch-Based Design,” IEEE J. Solid-State Circuits, vol. 43 no. 4, pp. 864-873, 2008. [3] S.C. Chan, et al., “A Resonant Global Clock Distribution for the Cell Broadband Engine Processor,” IEEE J. Solid-State Circuits, vol. 44, no1, pp. 64- 72, 2009. [4] A. Ishii, et al., “A Resonant Clock 200MHz ARM926EJ-S Microcontroller,” European Solid-State Circuits Conf., pp. 356-359, 2009. [5] H. McIntyre, et al., “Design of the Two-core x86-64 AMD ‘Bulldozer’ Module in 32 nm SOI CMOS,” IEEE J. Solid-State Circuits, 2012. 978-1-4673-0377-4/12/$31.00 ©2012 IEEE
Transcript
Page 1: ISSCC 2012 / SESSION 3 / PROCESSORS / 3 - IEEEewh.ieee.org/r5/denver/sscs/References/2012_02_Sathe.pdf · 68 † 2012 IEEE International Solid-State Circuits Conference ISSCC 2012

68 • 2012 IEEE International Solid-State Circuits Conference

ISSCC 2012 / SESSION 3 / PROCESSORS / 3.7

3.7 Resonant Clock Design for a Power-Efficient High-Volume x86-64 Microprocessor

Visvesh Sathe1, Srikanth Arekapudi2, Charles Ouyang2, Marios Papaefthymiou3,4, Alexander Ishii3, Samuel Naffziger1

1AMD, Fort Collins, CO2AMD, Sunnyvale, CA3Cyclos Semiconductor, Berkeley, CA4University of Michigan, Ann Arbor, MI

AMD’s 4+ GHz x86-64 core codenamed “Piledriver” employs resonant clocking[1-4] to reduce clock distribution power up to 24% while maintaining a lowclock-skew target. To support testability and robust operation at the wide rangeof operating frequencies required of a commercial processor, the clock systemoperates in two modes: direct-drive (cclk) and resonant (rclk). Leveraging favor-able factors such as the availability of two thick top-level metals, high operatingfrequency, clock-load density, and the existing clock-design methodology [5],the rclk mode was designed to enable both reduced average power dissipationand improved peak-power-constrained performance, with minimal area impact.This work represents a volume production-enabled implementation of resonantclock technology, and is plan of record for mid-2012 product offerings.

Rclk allows power reduction by recycling charge using LC-resonance, whichenables further power reduction by reducing clock driver strength. Figure 3.7.1shows a simplified schematic of the dual-mode clock system. The mode switchMSw is closed (open) in rclk (cclk) mode. The clock driver features a pulse-drivemode for additional efficiency improvement through duty cycle control of thepull-up and pull-down switches. TSw is a throttle switch employed to reducevoltage overshoot when the MSw is turned off during frequency changes.

To operate in both modes, the clock driver needs to support frequency-depend-ent drive-strength and pulse modulation, both of which are efficiently imple-mented using a split-buffer topology. In rclk mode, drive strength is modulatedthrough drvEn settings during P-state transitions. Pulse drive is used to enablea finer trade-off between conduction and switching losses in the driver. A localdelay line delays only the asserting edges of the pull-up/down stage during pulsedrive (plsEn = 1), whereas respective de-asserting edges are triggered by thenon-delayed clock. Thus, the driver output duty cycle is obtained by program-ming the local delay to modulate the input duty cycle. This pulse-shapingmethod has three advantages: 1) Enabling PLL duty cycle control of the clock totune performance; 2) guaranteeing robust clock slew and amplitude when oper-ating off the V-f curve; and 3) reducing susceptibility in rclk skew due to processvariation in the low-delay local delay chains.

Figure 3.7.2 shows the Piledriver global clock construction in which a set of fivehorizontal-folded clock trees (HCK tree) drive a global clock grid [5]. Each HCKtree has up to 25 inductors interleaved with clock drivers. The clock mode andfrequency-dependent clock parameter settings (inductor connection, drivestrength, pulse width) are adjusted during power-up and each P-state transition,during which time the clock mode parameters are initialized through a P-stateindexed fuse table. The power reduction achieved from rclk in each P-state isaccounted for by the power management unit. The clock mode parameters areloaded by a sequencer in the transmit block, which distributes them to the HCKtrees through a source-synchronous bus inside the vertical clock tree module.Once received by the HCK trees, these parameters are broadcast to all clock driv-ers within each HCK tree. To avoid a circular dependence between the globalclock and logic used to program the clock, all programming logic in the HCKtrees is clocked by a broadly distributed intermediate stage of the clock tree.Existing clock gating mechanisms are leveraged to prevent the exposure of tim-ing elements in the CPU to transitional clocks.

Building inductors with a good quality factor Q is critical to rclk efficiency, and isconstrained by several factors. The inductor windings have to be designed toshare metal resources on the top two metal layers (M10 and M11) with densepower distribution. Moreover, they must accommodate a substantial number ofpre-clock distribution nets and global nets that are routed through, as well asunder the inductor. Figure 3.7.3 illustrates inductor design under these con-

straints. At the frequencies of interest, Q is dominated by winding resistance.The inductor was therefore designed using M10 and M11, with cut-aways toallow maximal use of the metal layers in the presence of routes and power-sup-ply trunks. Inductor placement was directed so that power-supply trunks passthrough the middle of the inductor, minimizing the impact of inductive coupling.Effectively utilizing hitherto unused top-level metal resources in inductor designhelped avoid adverse IR impact. The power grid under the inductor was designedto be “loop-less” to mitigate Q degradation resulting from eddy losses, whilemaintaining a robust grid. Five different inductors were built in the 0.6-to-1.3nHrange, for selection based on local clock loading. At 4GHz, inductor Q factorsachieved were in the range of 3.5-3.8.

Figure 3.7.4 shows the structures required to support rclk (MSw, inductor,TankCap) that are tiled across the HCK-tree. MSw connects the inductor to theclock grid through the Driver-MSw shorting-bar. Skew was controlled by usingan LP formulation to perform inductor allocation on the grid, and through inter-leaved driver/inductor placement. For each inductor, MSw size was tuned totrade-off reduced switch resistance with the increased switch parasitic capaci-tance that results from larger switches. For efficient rclk operation, a large, low-ESR TankCap is required within a limited allocated area. To that end, a capacitorstructure of approximately six times the average clock load was implementedusing both metal and gate structures.

Figure 3.7.5 shows measured Cac (defined as Cac = Pdynamic / V2f) savings and

efficiency numbers, based on power dissipation in the clock drivers and grid, incclk and rclk modes. A high-switching activity test pattern was used for the clockpower measurement. Efficiency increases up to 3.3GHz, and declines more grad-ually at higher frequencies. The inherent asymmetry in energy efficiency oneither side of the resonant frequency is increased due to a voltage-dependent Q(from the series-connected MSw) and a stringent clock slew criterion thatrequires a stronger drive at lower frequencies. Full-chip simulation analysisshowed a 1ps increase in rclk skew compared to cclk.

Figure 3.7.6 shows cclk, and rclk waveforms with different drive strength config-urations from a full-chip clock simulation at 1.2V, 4.25GHz. The rclk_3/8 modeuses clock drivers that are 3/8 of the clock driver strength in cclk mode.Reducing clock driver strength in rclk enables greater Cac savings at the expenseof reduced clock slew rates. These reduced slews result in increased cross-overcurrent in the clock receivers. Measurements however, indicate a negligiblechange in efficiency for high-activity workloads as compared to idle workloads,indicating that this effect is small. Reduced slew also causes a push-out in the50% arrival time of the clock, potentially affecting both gater-enable paths andcross-clock domain communication. Static timing analysis with degraded slewswas run on the core, and resulting paths fixed.

Figure 3.7.7 shows the microphotograph of the Piledriver core. Over the frequen-cy range 3.2-to-4.4GHz, the power savings from rclk enable either a frequencyincrease of about 100 MHz for the same power, or a power reduction of 5-10%for the same frequency.

Acknowledgements:The authors thank Tom Meneghini, Kyle Viau, Manivannan Bhoopathy, JooheeKim, Jerry Kao, Fred Brauchler, Alan Arakawa, Syed Obaidulla, Kevin Hurd,Vasant Palisetti, and Denny Renfrow for their valuable contribution to this work.

References:[1] A.J. Drake, et al., “Resonant Clocking using Distributed Parallel Capacitance,”IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1520-1528, 2004.[2] V.S. Sathe, et al., “Resonant Clock Latch-Based Design,” IEEE J. Solid-StateCircuits, vol. 43 no. 4, pp. 864-873, 2008.[3] S.C. Chan, et al., “A Resonant Global Clock Distribution for the CellBroadband Engine Processor,” IEEE J. Solid-State Circuits, vol. 44, no1, pp. 64-72, 2009.[4] A. Ishii, et al., “A Resonant Clock 200MHz ARM926EJ-S™ Microcontroller,”European Solid-State Circuits Conf., pp. 356-359, 2009.[5] H. McIntyre, et al., “Design of the Two-core x86-64 AMD ‘Bulldozer’ Modulein 32 nm SOI CMOS,” IEEE J. Solid-State Circuits, 2012.

978-1-4673-0377-4/12/$31.00 ©2012 IEEE

Page 2: ISSCC 2012 / SESSION 3 / PROCESSORS / 3 - IEEEewh.ieee.org/r5/denver/sscs/References/2012_02_Sathe.pdf · 68 † 2012 IEEE International Solid-State Circuits Conference ISSCC 2012

69DIGEST OF TECHNICAL PAPERS •

ISSCC 2012 / February 20, 2012 / 4:45 PM

Figure 3.7.1: Simplified model of AMD’s “Piledriver” dual-mode global clocknetwork.

Figure 3.7.2: Global-clock organization and distribution. A folded clock-tree(VCK tree) drives 5 horizontal folded clock trees (HCK tree).

Figure 3.7.3: Inductor design on the top two metal layers with cut-aways toaccommodate power straps and global signal routes.

Figure 3.7.5: Measured Cac(pF) savings and clock efficiency vs. frequency.Peak efficiency is observed at 3.3GHz in square-mode. Figure 3.7.6: Simulated cclk and rclk waveforms at 1.2V, 4.25GHz.

Figure 3.7.4: Relative placement of rclk components within a repeated HCKtree section.

3

Page 3: ISSCC 2012 / SESSION 3 / PROCESSORS / 3 - IEEEewh.ieee.org/r5/denver/sscs/References/2012_02_Sathe.pdf · 68 † 2012 IEEE International Solid-State Circuits Conference ISSCC 2012

• 2012 IEEE International Solid-State Circuits Conference 978-1-4673-0377-4/12/$31.00 ©2012 IEEE

ISSCC 2012 PAPER CONTINUATIONS

Figure 3.7.6: Chip Microphotograph of the 32nm AMD “Piledriver” core.


Recommended