+ All Categories
Home > Documents > HC17.S8T1.Intel 8xx series and Paxville Xeon-MP ... 8xx series and Paxville Xeon-MP Microprocessors...

HC17.S8T1.Intel 8xx series and Paxville Xeon-MP ... 8xx series and Paxville Xeon-MP Microprocessors...

Date post: 17-Mar-2019
Category:
Upload: lamphuc
View: 224 times
Download: 0 times
Share this document with a friend
26
Intel 8xx series and Paxville Intel 8xx series and Paxville Xeon-MP Microprocessors Xeon-MP Microprocessors Jonathan Douglas Jonathan Douglas Intel Corporation Intel Corporation Thanks to: Justin Marquart, James Vogeltanz, Mike Grassi, DEG/BCG package design, Donald Parker & Benson Inkley for help in putting together this presentation.
Transcript

Intel 8xx series and PaxvilleIntel 8xx series and Paxville

Xeon-MP MicroprocessorsXeon-MP Microprocessors

Jonathan DouglasJonathan Douglas

Intel CorporationIntel Corporation

Thanks to: Justin Marquart, James Vogeltanz, Mike Grassi, DEG/BCG

package design, Donald Parker & Benson Inkley for help in putting

together this presentation.

22

OutlineOutline Why the move to multi-coreWhy the move to multi-core

Overview of 8xx series Pentium4Overview of 8xx series Pentium4

Challenges in moving CPU infrastructureChallenges in moving CPU infrastructureto multi-coreto multi-core

Learning's from the 8xx series Pentium4Learning's from the 8xx series Pentium4designdesign

Overview of Paxville-MP processorOverview of Paxville-MP processor

Going forward with multi-core designsGoing forward with multi-core designs

ConclusionConclusion

33

Why rapid move to Dual-CoreWhy rapid move to Dual-Core

Single core designs hitting powerSingle core designs hitting power

wall.wall.

Need more power efficient way toNeed more power efficient way to

manage OS loading.manage OS loading.

Natural extension of softwareNatural extension of software

migration to multi-threaded apps.migration to multi-threaded apps.

More threads in 1 core is complexMore threads in 1 core is complex

and tax core resources heavily.and tax core resources heavily.

Competitive response.Competitive response.

44

Overview of 8xx seriesOverview of 8xx series

Pentium4 processorPentium4 processor Dual-Core/Multi-Threaded Pentium®4Dual-Core/Multi-Threaded Pentium®4

Processor on 90nm processProcessor on 90nm process

2-1M caches, speeds to 3.2Ghz, support for over2-1M caches, speeds to 3.2Ghz, support for overclocking, up to 4 threads.clocking, up to 4 threads.

Shared 800Mhz quad-pumped FSB.Shared 800Mhz quad-pumped FSB.

Independent bus tuning per agentIndependent bus tuning per agent

Enhanced auto-halt and 2-state speed stepEnhanced auto-halt and 2-state speed steppower managementpower management

Independent events supported per core.Independent events supported per core.

55

High level block diagramHigh level block diagram

System BusSystem Bus

L2 C

ach

e a

nd

Co

ntro

l

FP RF

FM

ul

FA

dd

MM

X

SS

E

FP

mo

ve

FP

sto

re

L2 C

ach

e a

nd

Co

ntro

l

L1 D-Cache and D-TLB

Sto

re

AG

UL

oa

d

AG

U

Schedulers

Integer RF

AL

U

AL

U

AL

U

AL

U

Trace Cache

Rename/Alloc

uop Queues

BT

B

uC

od

e

RO

M

33

Decoder

BTB & I-TLB

L2 C

ach

e a

nd

Co

ntro

l

FP RF

FM

ul

FA

dd

MM

X

SS

E

FP

mo

ve

FP

sto

re

L2 C

ach

e a

nd

Co

ntro

l

L1 D-Cache and D-TLB

Sto

re

AG

UL

oa

d

AG

U

Schedulers

Integer RF

AL

U

AL

U

AL

U

AL

UTrace Cache

Rename/Alloc

uop Queues

BT

B

uC

od

e

RO

M

33

Decoder

BTB & I-TLB

Core-To-Core Communication

66

Zoom in on Front-SideZoom in on Front-Side

BusBus

77

Why the shared bus designWhy the shared bus design

Time to market a critical factorTime to market a critical factor

Leverages existing P4 coreLeverages existing P4 core

Uses existing 775-LGA socketUses existing 775-LGA socket

P4 core already has right feature setP4 core already has right feature set

P4 FSB already 4-way compliant.P4 FSB already 4-way compliant.

Already architected with thread independentAlready architected with thread independent

power management.power management.

Already Already ‘‘HTHT’’ so 2 cores = 4 threads so 2 cores = 4 threads

Gives independent cachesGives independent caches

Plus no extra latency to external memory.Plus no extra latency to external memory.

88

Dual core performanceDual core performance

1) Intel® Pentium® 4 Processor with HT Technology Extreme Edition

3.73GHz (2 MB L2 Cache, 1066 MHz FSB) and Intel® 925XE Express Chipset

2) Intel® Pentium® Processor Extreme Edition 840 (2x1 MB L2 Cache, 3.20

GHz, 800 MHz FSB, HT Technology) and Intel® 955X Express Chipset

99

Challenges in migrating toChallenges in migrating to

multi-coremulti-core Rapid movement from single coreRapid movement from single core

design to multi-core designdesign to multi-core design

presented many complexitiespresented many complexities

Already existing platform hardwareAlready existing platform hardware

Factory already populated withFactory already populated with

manufacturing hardwaremanufacturing hardware

Test database developed for single coreTest database developed for single core

Tight package dimensionsTight package dimensions

Little power headroom leftLittle power headroom left

1010

Package issuePackage issue

Package design a huge challengePackage design a huge challenge

More layers required (Just address/data alone is >More layers required (Just address/data alone is >100 more signals)100 more signals)

Same package cavity and pinout Same package cavity and pinout –– couldn couldn’’t grow.t grow.

New IHS (Integrated Heat Sink) required for thickerNew IHS (Integrated Heat Sink) required for thickerpackagepackage

Power cap placement canPower cap placement can’’t be centered over botht be centered over bothcorescores

Existing signals on 4 sides of core causes powerExisting signals on 4 sides of core causes powerbus routing voids.bus routing voids.

No logic outside core. Any needed logic mustNo logic outside core. Any needed logic mustbe in core. Lots of be in core. Lots of ‘‘special signalspecial signal’’ headaches headacheslike thermal diode, ODT (On-Die Termination).like thermal diode, ODT (On-Die Termination).

1111

1212

Power constraintsPower constraints

Existing platform dictated 1 power planeExisting platform dictated 1 power planefor both coresfor both cores

Penalized for 2X leakage, required architectingPenalized for 2X leakage, required architectinga speed-step protocola speed-step protocol

2 cores powering up & fully active cause2 cores powering up & fully active causelarge di/dt eventslarge di/dt events

Required Voltage Regulator mods to growRequired Voltage Regulator mods to growheadroom to 125A plus silver box restrictionsheadroom to 125A plus silver box restrictions

Required BIOS change to boot to lowRequired BIOS change to boot to lowvoltage/frequency on performance parts.voltage/frequency on performance parts.

BIOS initiates speedstep event to all threadsBIOS initiates speedstep event to all threadsafter completionafter completion

1313

2-core boot to full speed, weak power supply

1414

Test issuesTest issues

Thousands of hours invested inThousands of hours invested insingle core coverage databasesingle core coverage database

Copied core design a plusCopied core design a plus

Needed to add Needed to add ‘‘core swap & killcore swap & kill’’hardware to reuse databasehardware to reuse database

Existing single core test canExisting single core test can’’ttexpose problems on core->coreexpose problems on core->coreinteractioninteraction

Voltage transients, thermal gradientVoltage transients, thermal gradient

Some explicit dual core contentSome explicit dual core contentrequiredrequired

1515

Test flow exampleTest flow example

Sort Die

Independent

Screen/Speed

Grade Core0

Switch

Cores

Screen/Speed

Grade Core1

Multi-Core

Content Run

Check Power

of 2 cores

Check Speed

Step voltage

Snap to

lowest Speed

1616

Thermal issueThermal issue

Platforms support only 1 ADC forPlatforms support only 1 ADC for

thermal monitoringthermal monitoring

2 cores can create many different2 cores can create many different

thermal profilesthermal profiles

Diode temp to junction hot spot deltaDiode temp to junction hot spot delta

can vary depending on workload & corecan vary depending on workload & core

utilizedutilized

Required thermal protection to beRequired thermal protection to be

independent on both coresindependent on both cores

1717

Thermal gradientsThermal gradients

1 4 7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

S1

S4

S7

S10

S13

S16

S19

S22

S25

S28

98.00-100.00

96.00-98.00

94.00-96.00

92.00-94.00

90.00-92.00

88.00-90.00

86.00-88.00

84.00-86.00

82.00-84.00

80.00-82.00

78.00-80.00

76.00-78.00

74.00-76.00

72.00-74.00

70.00-72.00

68.00-70.00

66.00-68.00

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

S1

S3

S5

S7

S9

S11

S13

S15

S17

S19

S21

S23

S25

S27

S29 92.00-94.00

90.00-92.00

88.00-90.00

86.00-88.00

84.00-86.00

82.00-84.00

80.00-82.00

78.00-80.00

76.00-78.00

74.00-76.00

72.00-74.00

Thermal

Diode

Core0 only

Dual Core

1818

Limitations of shared busLimitations of shared bus

2 loads on bus = less bus speed.2 loads on bus = less bus speed.

Plus 1M cache = more bus traffic. DoublePlus 1M cache = more bus traffic. Double

whammy.whammy.

Difficult package designDifficult package design

~2x traces to same number pins~2x traces to same number pins

Thermal & electrical properties degrade.Thermal & electrical properties degrade.

Slow down penalizes both cores.Slow down penalizes both cores.

Segregated die.Segregated die.

Test overhead. Slowest die constrains finalTest overhead. Slowest die constrains final

product.product.

1919

Overview of Paxville-MPOverview of Paxville-MP

processorprocessor Dual-Core/Multi-Threaded Xeon ProcessorDual-Core/Multi-Threaded Xeon Processor

on 90nm processon 90nm process

2-2M caches, 667Mhz min FSB, up to 4 threads.2-2M caches, 667Mhz min FSB, up to 4 threads.

Platform still 4-P compatible for up to 16 threadsPlatform still 4-P compatible for up to 16 threads

per platformper platform

Dual bus platform Dual bus platform –– 2 CPU agents per bus 2 CPU agents per bus

Only 1 load presented to system by CPUOnly 1 load presented to system by CPU

Enhanced auto-halt and 2-state speed stepEnhanced auto-halt and 2-state speed step

power managementpower management

Independent events supported per core.Independent events supported per core.

2020

Advantages of new PaxvilleAdvantages of new Paxville

designdesign Single CPU load on bus. Allows fasterSingle CPU load on bus. Allows faster

bus, less electrical load.bus, less electrical load.

8 agents (16 threads) on top end platform8 agents (16 threads) on top end platform

Larger cache = less FSB bottlenecksLarger cache = less FSB bottlenecks

Better package designBetter package design

Fewer traces allows better power deliveryFewer traces allows better power delivery

Integrated die (monolithic)Integrated die (monolithic)

Consolidated bus logic allows testConsolidated bus logic allows test

enhancementsenhancements

2121

Paxville consolidate busPaxville consolidate bus

2222

Challenges with PaxvilleChallenges with Paxville

designdesign Degraded I/O timing with shared busDegraded I/O timing with shared bus

Requires extra logic & routing but must beRequires extra logic & routing but must be

compatible to existing bus timing.compatible to existing bus timing.

Requires circuit tricks for quad pumped bus.Requires circuit tricks for quad pumped bus.

Enhancements to validation toolsEnhancements to validation tools

8xx series treated as 2 independent CPUs.8xx series treated as 2 independent CPUs.

Paxville is integrated Paxville is integrated –– 1 die. 1 die.

Additional complexity in testAdditional complexity in test

infrastructure.infrastructure.

New test modes & consolidated bus logic.New test modes & consolidated bus logic.

2323

Going forward with multi-coreGoing forward with multi-core

Solving bus bottlenecks.Solving bus bottlenecks.

Integrate next level cache for less busIntegrate next level cache for less bustraffic.traffic. Downside is higher latency on cache misses.Downside is higher latency on cache misses.

Upside is lower pin count & can stay with aUpside is lower pin count & can stay with aflexible bus architectureflexible bus architecture

Cache thrashing by multiple cores an issue ifCache thrashing by multiple cores an issue ifsize isnsize isn’’t large enough t large enough –– swamps bus again. swamps bus again.

‘‘Point-to-pointPoint-to-point’’ busses & memory busses & memorycontrollerscontrollers Upside is no bus traffic collisionsUpside is no bus traffic collisions

Downsides are being locked into memoryDownsides are being locked into memoryprotocol and a huge pin count increase.protocol and a huge pin count increase.

2424

Going forward with multi-coreGoing forward with multi-core

Solving power issues..Solving power issues..

Need better power state managementNeed better power state management

Single voltage plane is an issue Single voltage plane is an issue –– can can’’t dropt drop

leakage on inactive coresleakage on inactive cores

Need more intelligence in controllerNeed more intelligence in controller

Segment products with power in mindSegment products with power in mind

Typically done more now on speed/feature set.Typically done more now on speed/feature set.

Can microprocessor be Can microprocessor be ‘‘tunedtuned’’ for a power for a power

segment.segment.

2525

SpeedStep protocolSpeedStep protocol

Core0 high activity Core0 asleep Core0 low activity

Core1 high activityCore1 asleep Core1 low activityCore1 asleep Core1 high activity

Core Activity over time

High

voltage

Low

voltage

High

voltage

Low

voltage

Limited opportunities to

reduce power, much harder

with even more cores

2626

Going forward with multi-coreGoing forward with multi-core

Core counts will continue to increase.Core counts will continue to increase.

Higher threaded applications give opportunityHigher threaded applications give opportunityto have better power / performance.to have better power / performance.

Power is wasted when a core that isnPower is wasted when a core that isn’’ttworking on a thread is alive, but performanceworking on a thread is alive, but performanceis wasted if OS has to continually swap outis wasted if OS has to continually swap outthreads.threads.

Expect that logic to Expect that logic to ‘‘glueglue’’ cores together cores togetherwill become as critical as the corewill become as critical as the core

Need lots of sophistication to take fullNeed lots of sophistication to take fulladvantage of a high core countadvantage of a high core count

Need busses capable of handling the highNeed busses capable of handling the hightraffic to memorytraffic to memory


Recommended