ISSCC 2005 / SESSION 10 / MICROPROCESSORS AND SIGNAL

182 • 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE.

ISSCC 2005 / SESSION 10 / MICROPROCESSORS AND SIGNAL PROCESSING / 10.1

10.1 The Implementation of a 2-core Multi-Threaded Itanium®-Family Processor

Samuel Naffziger, Blaine Stackhouse, Tom Grutkowski

Intel, Fort Collins, CO

The next generation in the Itanium® processor family, codenamed Montecito is introduced. The processor has two dual-threaded cores integrated on die with 26.5MB of cache in a 90nmprocess with 7 layers of copper interconnect. The die is 21.5mmby 27.7mm and includes 1.72B transistors. With both cores at fullfrequency, it consumes 100W. The micro-architecture and circuitmethodologies are leveraged from the prior Itanium2 proces-sors[2]. Improvements include the integration of 2 cores on-die,each with a dedicated 12MB 3rd-level cache, a 1MB 2nd-level cacheand dual-threading. Susceptibility to soft errors is also reducedand power efficiency improved through low-power techniques andactive power management.

Multi-threading addresses the memory latency issue in largeSMP systems. The Montecito style of multi-threading is referredto as temporal multi-threading (TMT) where the thread switch istriggered by long-latency events such as a 3rd-level cache miss.Duplicating all architectural state and some micro-architecturalstate creates the two logical processors that then share all theexecution resources and the large caches. To the operating sys-tem, each Montecito die looks like 4 processors. The thread-switch logic and state is implemented without a cycle-timeimpact. At any given time, only one thread is active in thepipeline of each core and consumes all the physical executionresources while the caches continue to service both threads. Theperformance benefits of TMT on database workloads are estimat-ed to range from 15 to 35% depending on memory latency. Theinteger and floating-point register files implement a duplicate setof registers in an area-efficient manner with a 15% areaimpact[2]. Other architectural state that must be duplicatedrequires dual-threaded latches (Fig. 10.1.5). The changes forTMT result in less than a 2% area growth in the core providing avery power- and area-efficient performance increase for SMP sys-tems.

Soft errors caused by high-energy particles negatively affect sys-tem reliability resulting in customer downtime which is notacceptable for servers. Montecito provides several key enhance-ments over the previous Itanium processors. First, the parityprotection of the 2nd-level data-cache tag structures is enhancedwith ECC protection. This feature allows the transparent fixingof 86% of bits that would otherwise cause the processor to crashon a detected error. Second, the large L3 caches provide supportfor automatic disabling of individual cache lines that suffer fromrepeated errors. This is achieved by the combination of hardwareand firmware support. Third, parity protection is added for thelarge register file structures (2 sets of 128 dual-threaded regis-ters), with a low-overhead approach [2].

Power reduction is targeted as a primary design priority. Atissue, is the fact that the last 3 generations of Itanium processorsall had the same power consumption (130W) for a single-coreprocessor despite process scaling. For Montecito to provide com-petitive database performance, 2 processor cores are integratedon-die and the cache is grown. Scaling indicates that this wouldconsume nearly 300W in 90nm at the 2.0+GHz target frequency– clearly unacceptable. To reduce power while also providing anew level of power manageability, several capabilities are need-ed:Power in should be managed in the processor, not just tempera-ture (~Power out). Power in is a major driver of system and datacenter operating costs. Power out must be managed also as it

affects heat-sink design and form-factor issues, but it is only halfthe problem.

The processor should be able to manage its power to a dynami-cally adjustable limit. This provides flexibility in system designand data-center management. The processor should also adaptits operating point for each power setting to extract the maximumperformance/watt possible to maximize flexibility. This meansmaximizing frequency as a function of voltage.

In order to reduce power and provide the above capabilities, theMontecito processor applies the full range of traditional CMOSpower-reduction techniques to the core design inherited from theprevious generation. These include reducing leakage throughlong-L and high-Vt device insertion, gating-off clocks, and reduc-ing FET width in over-designed circuits. The result is a powerreduction of 28% for integer applications.

Montecito also targets the three main sources of power variabili-ty for removal: the difference between average application powerand the maximum (1.4X at the chip level for Itanium2); the vari-ability in leakage due to silicon processing which is at least 2X ofa nominal 25% power component; and voltage variability due toregulator and package variations and transients that are ~10% ofnominal.

Measuring power in required the invention of several new tech-nologies: An ammeter is integrated into the chip and package tocontinuously monitor the power consumed by the die[3]. Thisintegration not only ensures good accuracy (~3%), but also dovetails with existing manufacturing test flows and eliminatesthe dependence on external components. Next, an autonomousmicrocontroller is implemented on-die to enable the processor toflexibly manage the voltage of the chip in response to changingpower consumption and thermal conditions[3]. Finally, the coredesign has adaptive frequency to continuously varying voltagewhich in turn requires several capabilities: First, an asynchro-nous interface between cores and system bus, where one cycle ofmeta-stability resolution time is achieved using a high-gainresolver latch (Fig. 10.1.6). Second, a frequency synthesizer thatis repeatable (manufacturing / test needs) by being as digital aspossible, has a very low frequency-change latency (1 cycle) foradapting to voltage transients, enables fine-grained frequencychanges (1/64th of Tmin), and has good range and low jitter[5]. Thefinal capability is a core that robustly operates across a broadrange of frequencies and voltages. This required hold-time analy-sis at two voltage points and checks for low-voltage circuit opera-tion as well as an active clock deskew system[4] that continuous-ly maintains clock alignment within 10ps across each core asvoltage and temperature vary.

Acknowledgements:Many thanks to a talented and dedicated team from Intel in making thisambitious design a reality.

References:[1] S. Naffziger et al, “The Implementation of the Itanium2Microprocessor,” IEEE J. of Solid-State Circuits, vol. 37, no. 11, pp 1448-1460, Nov., 2002.[2] E. Fetzer et al., “The Multi-Threaded, Parity-Protected, 128-WordRegister Files on a Dual-Core Itanium®-Family Processor,” ISSCC Dig.Tech. Papers, Paper 20.5, pp. 382-383, Feb., 2005.[3] C. Poirier et al., “Power and Temperature Control on a 90nm Itanium®-Family Processor,” ISSCC Dig. Tech. Papers, Paper 16.7, pp. 304-305,Feb., 2005.[4] E. Fetzer et al., “Clock Distribution on a Dual-core, Multi-threadedItanium® Family Processor,” ISSCC Dig. Tech. Papers, Paper 16.1, pp.292-293, Feb., 2005.[5] T. Fischer et al., “A 90nm Variable-Frequency Clock System for aPower-Managed Itanium®-Family Processor,” ISSCC Dig. Tech. Papers,Paper 16.2, pp. 294-295, Feb., 2005.

183DIGEST OF TECHNICAL PAPERS •

Continued on Page 592

ISSCC 2005 / February 8, 2005 / Salon 8 / 8:30 AM

Figure 10.1.1: Chip power consumption. Figure 10.1.2: Chip statistics at nominal operating point.

Figure 10.1.3: Power reduction.

Figure 10.1.5: Dual threaded latch. Figure 10.1.6: Clock domain crossing resolver latch.

Figure 10.1.4: Power management.

Core 0,frequency tracks voltage, 40W

250m

W

Foxton Control1GHz fixed frequency

Bus ArbiterFixed frequency, 2.5W

Bus

I/O

, fix

ed fr

eque

ncy,

8W

tota

l

12M

B L

3 ca

che,

as

ynch

rono

usop

erat

ion,

2.5

W

12MB

L3 cache, asynchronousoperation, 2.5W

Core 1,frequency tracks voltage, 40W

8W totalBus1.2VIO term6W totalFixed1.15VFixed

5W totalSelf-timed.9-1VCache80W totalTracks voltage.8 – 1.2VCoresPowerFrequencyVoltageDomain

.06%1.72GTotal

0.3%6.7MBus logic & IO01550ML3 Cache0106.5MCore caches1.7%57MCore logiclow VtFET count

Clock distribution

5%Final clock

buffer20%

Logic sw itching

50%

Leakage (avg.)25%

Core power81%

L3 cache5%

Bus/IO logic6%

IO termination6%

Variation/GB2%

Core Power BreakdownChip Power Breakdown

Legacy Chip Frequency

Adaptive Chip Frequency

Today: Minimum Vcc(t) determines maximum (and avg.) frequency.Montecito: Average Vcc(t) determines average frequency.

5-10%

F(V(t)) Produced by voltage detectors in conjunction with digital frequency synthesizers to provide frequency change response to a voltage transient in 1.5 cycles on average

Vcc(t)

F(V(t))

Favg(t)

Fmin(t)

0.40.5

0.60.7

0.80.9

11.6

1.82

2.22.4

0

20

40

60

80

100

120

140

160

180

200

Power(Watts)

Activity FactorGHz = F(V)

Power Consumption ProfileApplications Vary in Activity Factor from .4 to 1.0, average integer applications are <=.60

With power measurement and fine grained frequency change as a function of voltage, max power can be reduced from 140W to 100W with no performance impact to most applications, or equivalently, frequency can be increased ~12%

Chip power consumption vs. application activity factor and frequencyassuming the max frequencypossible for a given voltage. Profile varies with temperature and process.

0

50

100

150

200

250

300

350

Simple Port of Madison Montecito Final

Cache SwitchCache IgateCache IoffCore SwitchCore IgateCore IoffIO bias DCAP lkg

FrequencyFrequencyGain @ Gain @ 100W100W

FeatureFeature

3%3%Manage junction temperature to Manage junction temperature to the minimum possiblethe minimum possible

5%5%AdaptAdapt VccVcc to optimal value for to optimal value for each parteach part

11%11%Optimize circuits for low Optimize circuits for low switching and low leakageswitching and low leakage

7%7%Adapt frequency to Adapt frequency to VccVcc to to

operate atoperate at frequency(Vfrequency(Vavgavg))vs.vs. frequency(Vfrequency(Vminmin))

12%12%Manage to application power vs. Manage to application power vs. max powermax power

5%5%Cache power optimization (low Cache power optimization (low V,V, asynchasynch. etc.). etc.)

With P With P ~ F~ F33::P = PP = Porigorig/(1+.12+.05+.03+.07+.11+.05)/(1+.12+.05+.03+.07+.11+.05)33 ==

PPorigorig/2.92/2.92

Threadswitchswitch_pck: fires on thread switch

Inactivethreadstorage

switch_pck

Pulse clock (pck)BA

in

Scan

I

I

I

I

O

I

I

I

I

delay delay

q

VDD

GND

nsb

sin

GND

VDD

in1

fb

shift

VDD

GND

ns sout

VDDVDD

GND

GND

inb

pckGND

VDD

ina

GND

VDD

in

npck

anp

ckb

output

Resolve tau 5psResolve time = 1/f1- Thold1 – Tsu2

frequency=f1 frequency=f2

Din

SmallDevices

Standard Pulse ClockOutput Clock DomainInput Clock DomainStandard Pulse Clock

f1 f2

DevicesLarge

GNDGND

pck_in pck_out

input

output

10

592 • 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE.

ISSCC 2005 PAPER CONTINUATIONS

Figure 10.1.7: Die micrograph.

21.5

mm

21.5

mm

27.72 mm27.72 mm

Pipeline ControlPipeline Control

12MB L2 12MB L2 CacheCache

Bus

B

us I/

OI/OB

us

Bus

I/OI/O

Bus

B

us I/

OI/O

FoxtonFoxtonPowerPowermanagementmanagement

1MB1MBL1IL1ICacheCache

Floating PointFloating Point

IntegerInteger DatapathDatapath

L2L2TagTag

Branch UnitBranch Unit

16KB16KBL0DL0DCacheCache

256KB256KBL1D CacheL1D Cache

Bus LogicBus Logic

16KB16KBL0IL0ICacheCache

CLK

CLK

ALATALAT

HPWHPW

FuseFuse

BusBusArbiterArbiter

InstrInstr..FetchFetch

11I 11ICache Cache

Floating Point Floating Point

Integer IntegerDatapath Datapath

L2L2Tag Tag

Branch Unit Branch Unit

L0D L0DCache Cache

L1D Cache L1D Cache

Bus Logic Bus Logic

L0I L0ICache CacheALAT ALAT

HPW HPW

Instr Instr..Fetch Fetch

12MB L2 12MB L2 CacheCache

Technology: 90nm bulk, 7 layers Cu•1.72B transistors•596mm2

•2.0+GHz operation at self-selected voltage•100W electrical and thermal power limit•Two 11 issue, 2 way TMT EPIC cores•3 level on-chip cache per core – 16K L1I, 16K L1D, 1MB L2I, 256K L2D, 12MB unified L3

Figure 10.1.7

Date post:	04-Feb-2022
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

ISSCC 2005 / SESSION 10 / MICROPROCESSORS AND SIGNAL

Documents