+ All Categories
Home > Documents > HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

Date post: 16-Oct-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
30
The AMD Opteron™ CMP NorthBridge Architecture: Now and in the Future Pat Conway & Bill Hughes August, 2006
Transcript
Page 1: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

The AMD Opteron™ CMP NorthBridge Architecture: Now and in the Future

Pat Conway & Bill Hughes

August, 2006

Page 2: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future2

AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor

Integration:� Two 64-bit CPU cores

� 2MB L2 cache

� On-chip Router & Memory Controller

Bandwidth:� Dual channel DDR (128-bit) memory bus

� 3 HyperTransport™ (HT) links (16-bit each x 2 GT/sec x 2)

Usability and Scalability:

� Socket compatible: Platform and TDP!

� Glueless SMP up to 8 sockets

� Memory capacity & BW scale w/ CPUs

Power Efficiency:� AMD PowerNow!™ Technology with optimized power management

� Industry-leading system level power efficiency

Page 3: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future3

AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor

Page 4: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future4

I/O HubI/O HubUSBUSB

PCIPCI

PCIeTM

Bridge

PCIeTM

BridgePCIeTM

Bridge

PCIeTM

Bridge

I/O HubI/O Hub

8 GB/S

8 GB/S 8 GB/S

8 GB/S

PCI-E Bridge

PCI-E BridgePCI-E Bridge

PCI-E BridgePCIeTM

Bridge

PCIeTM

Bridge

USBUSB

PCIPCI

I/O HubI/O Hub

XMBXMBXMBXMB XMBXMB XMBXMB

SRQ

Crossbar

HTMem.Ctrlr

SRQ

Crossbar

HTMem.Ctrlr

SRQ

Crossbar

HTMem.Ctrlr

SRQ

Crossbar

HTMem.Ctrlr

A Clean Break with the Past

Memory Controller

Hub

Memory Controller

Hub

MCPMCP MCPMCPMCPMCP MCPMCP

Legacy x86 Architecture

� 20-year old traditional front-side bus (FSB) architecture

� CPUs, Memory, I/O all share a bus

� Major bottleneck to performance

� Faster CPUs or more cores ≠ performance

AMD64’s Direct Connect Architecture� Industry-standard technology

� Direct Connect Architecture reduces FSB bottlenecks� HyperTransport™ interconnect offers scalable high bandwidth

and low latency� 4 memory controllers – increases memory capacity and bandwidth

Chip

XChip

XChip

XChip

XChip

XChip

XChip

XChip

X

Page 5: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future5

4P System — Board Layout

Page 6: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future6

System Overview

HT-H

B

HTHT

MCTCore 0Core 1 S

RI

HT

HT-H

B

HT

MCTCore 0Core 1 S

RI

HT

HT-H

B

HT

MCT Core 0Core 1S

RI

HT-H

B

HT

HT

MCT Core 0Core 1S

RI

DRAM DRAM

DRAM

I/OI/O I/O

ncHT

cHT

DRAM

I/OI/O

12.8 GB/s128-bit

4.0GB/s per direction@ 2GT/s Data Rate

“NorthBridge”

XBAR XBAR

XBAR XBAR

Page 7: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future7

Northbridge Microarchitecture Overview

SystemRequestInterface

(SRI)

APIC(interrupts)

Crossbar(XBAR)

MemoryController

(MCT)

HT0 HT1 HT2

DRAMController 2 DDR2 channels

Core0

L2

Core1

L2

Y� Read Response

� Probe response

� Completion

Response

NBroadcast probesProbe

YPosted WritesPosted Request

Y� Read

� Non-posted Write

� Cache Block Commands

Request

DataUseVirtual Channel

Page 8: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future8

Northbridge Command FlowCore 0

All buffers are 64-bit command/address

Router

10-entry Buffer

Router

16-entry Buffer

Router

16-entry Buffer

Router

16-entry Buffer

Router

12-entry Buffer

Memory Command

Queue20-entry

Core 1

HT0 Input HT1 Input HT2 Input

Victim Buffer (8-entry)

Write Buffer (4-entry)

Instruction MAB (2-entry)

Data MAB (8-entry)

toDCT

HT0 Output HT1 Output HT2 Output

toCore

XBAR

Address MAP& GART

System RequestQueue

24-entry

Page 9: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future9

Northbridge Data Flow

Victim Buffer (8-entry)

Write Buffer (4-entry)

5-entry Buffer 8-entry Buffer8-entry Buffer 8-entry Buffer 8-entry Buffer

System Request

Data Queue12-entry

MemoryData Queue

8-entry

to Core to Host Bridge

to DCT

HT0 output HT1 output HT2 output

HT0 input HT1 input HT2 input

Core 0

Core 1from Host

Bridge

from DCT

All buffers are 64-byte cache lines

XBAR XBAR

Page 10: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future10

5: D

6: D

3: PI2

3: P

I1

Lessons Learned #1Allocation of XBAR Command buffer across Virtual Channels can have big impact on performance

L2 L2

L2L2

P 0 P 2

Memory 3Memory 1

Memory 2Memory 0

P 3P 1

8:S

D

3: PI0

4: D

5: R

P01

: RD

2: RD

3: RD

4: RP0

4: R

P2

7: SD

5: RP1

4: PI1

Request (2 visits)

Probe (3 visits)

Response (8 visits)

MP traffic analysis gives the best allocatione.g. Opteron Read Transaction

Page 11: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future11

Performance vs Average Memory Latency (single 2.8GHz core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache in MP system)

0

1

2

3

4

5

6

7

1N 2N 4N (SQ) 8N (TL) 8N (L)

Syste

m P

erf

orm

an

ce

0%

20%

40%

60%

80%

100%

120%

Pro

cesso

r P

erf

orm

an

ce

OLTP1

OLTP2

SW99

SSL

JBB

P6 P2 P0

P5

P4

I/O

I/O

P1 P3I/O

I/O

P7

8N Ladder

I/O I/O

P0

4 Node Square

P3 P1

P2

I/O

P0

1 Node

Lessons Learned #2Memory Latency is the Key to Application Performance!

AvgD 0 hops 1 hops 1.8 hops

Latency x + 0ns x + 44ns (124 cpuclk) x + 105ns (234 cpuclk)

0.5 hops 1.5 hops

x + 17ns (47 cpuclk) x + 76ns (214 cpuclk)

Performance vs Average Memory Latency (single 2.8GHz core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache in MP system)

0

1

2

3

4

5

6

7

1N 2N 4N (SQ) 8N (TL) 8N (L)

Syste

m P

erf

orm

an

ce

0%

20%

40%

60%

80%

100%

120%

Pro

cesso

r P

erf

orm

an

ce

OLTP1

OLTP2

SW99

SSL

JBB

P6 P2 P0

P5

P4

I/O

I/O

P1 P3 I/O

I/O

P7

8N Ladder

P6 P2 P0

P5

P4

I/O

I/O

P1 P3 I/O

I/O

P7

8N Twisted Ladder

I/OI/O

I/O I/O

P0

4 Node Square

P3 P1

P2

I/O I/OP0

2 Node

P1

I/O

P0

1 Node

Page 12: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

Looking Forward

Page 13: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future13

HyperTransport™-based Accelerators Imagine it, Build it

� Open platform for system builders (“Torrenza”)– 3rd Party Accelerators

– Media

– FLOPs

– XML

– SOA

� AMD Opteron™ Socket or HTX slot

� HyperTransport interface is an open standard see hypertransport.org

� Coherent HyperTransport interface available if the accelerator caches system memory (under license)

I/O HubI/O HubUSBUSB

PCIPCI

PCI-E Bridge

PCI-E Bridge

8 GB/S

8 GB/S 8 GB/S

8 GB/S

OtherOptimized

Silicon

OtherOptimized

Silicon

FLOPsAccelerator

FLOPsAccelerator

SOAAccelerator

SOAAccelerator

XMLAccelerator

XMLAccelerator

SRQ

Crossbar

HTMem.Ctrlr

SRQ

Crossbar

HTMem.Ctrlr

NativeDual-Core

NativeDual-Core

Page 14: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future14

AMD’s Next Generation Processor Technology

Native quad core die

Ideal for 65nm SOIand beyond

IPC enhanced CPU cores

� 32B instruction fetch

� Improved branch prediction

� Out-of-order load execution

� Up to 4 DP FLOPS/cycle

� Dual 128-bit SSE dataflow

� Dual 128-bit loads per cycle

� Bit Manipulation extensions (LZCNT/POPCNT)

� SSE extensions (EXTRQ/INSERTQ, MOVNTSD/MOVNTSS)

Enhanced Direct Connect Architecture and Northbridge

� Four ungangable x16 HyperTransportTM links (up to 5.2GT/sec)

� Enhanced crossbar

� Next-generation memory support

� FBDIMM when appropriate

� Enhanced power management and RAS

Expandable shared L3 cache

Page 15: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future15

Balanced, Highly Efficient Cache Structure

Efficient memory handling reduces the need for “brute force” cache sizes

L1 L2

Co

re 1

Dedicated L1

� Locality keeps most critical data in the L1 cache

� Low latency� 2 128 bit data paths� 2 loads per cycle

Cach

e

Co

ntr

ol

64

KB

51

2K

B

2+

MB

L3C

2C

3C

4

L2L1

L2L1

L2L1

Page 16: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future16

Balanced, Highly Efficient Cache Structure

Efficient memory handling reduces the need for “brute force” cache sizes

L1 L2

Co

re 1

Cach

e

Co

ntr

ol

64

KB

51

2K

B

2+

MB

L3C

2C

3C

4

L2L1

L2L1

L2L1

Dedicated L2

� Sized to accommodate the majority of working sets today

� Dedicated to help eliminate conflicts common in shared caches

Page 17: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future17

Balanced, Highly Efficient Cache Structure

Efficient memory handling reduces the need for “brute force” cache sizes

L1 L2

Co

re 1

Cach

e

Co

ntr

ol

64

KB

51

2K

B

2+

MB

L3C

2C

3C

4

L2L1

L2L1

L2L1

Shared L3 – Coming Soon

� Allocation policy which optimizes movement, placement and replication of data for multi-core

� Ready for expansion

Page 18: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future18

Additional HyperTransport™ Ports

� Enable Fully Connected 4 Node (four x16 HT) and 8 Node (eight x8 HT)

� Reduced network diameter – Fewer hops to memory

� Increased Coherent Bandwidth– more links

– cHT packets visit fewer links

– HyperTransport3

� Benefits– Low latency because of lower diameter

– Evenly balanced utilization of HyperTransport links

– Low queuing delays

Low latency under load

Four x16 HT

Eight x8 HT

OR

Page 19: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future19

4 Node Performance

4N SQ (2GT/s HyperTransport)

Diam 2 Avg Diam 1.00

XFIRE BW 14.9GB/s

W/ HYPERTRANSPORT3

4N FC (4.4GT/s HyperTransport3)

Diam 1 Avg Diam 0.75

XFIRE BW 65.8GB/s

XFIRE (“crossfire”) BW is the link-limitedall-to-all communication bandwidth (data only)

+ 2 EXTRA LINKS

4N FC (2GT/s HyperTransport)

Diam 1 Avg Diam 0.75

XFIRE BW 29.9GB/s

(2X) (4X)

4 Node Square

I/O I/O

I/O I/O

P0 P2

P1 P3

16

4 Node fully connected

I/O I/O

I/O I/O

P0 P2

P1 P3

16

Page 20: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future20

8 Node Performance

8N FC (4.4GT/s HyperTransport3)

Diam 1 Avg Diam 0.88

XFIRE BW 94.4GB/s

8N TL (2GT/s HyperTransport)

Diam 3 Avg Diam 1.62

XFIRE BW 15.2GB/s

8

P0 P2

P1 P3

P4 P6

P5 P7 5

I/O

1

I/O

I/O

4 3

2 1

0

5 4 3

2 1

0

5 4 3

2 1

0

5 4 3

2

0 P0 P2

P3 P1

I/O

8 Node 6HT 2x4

4 3

8N 2x4 (4.4GT/s HyperTransport3)

Diam 2 Avg Diam 1.12

XFIRE BW 72.2GB/s

OR

(5X) (6X)

8N Twisted Ladder

I/O I/O

I/O I/O

P0 P2

P1 P3

16P4 P6

P5 P7

8 Node 6HT 2x4

I/O

I/O

I/O

I/O

8 Node Fully Connected

8

I/O I/O

I/O I/O

I/O

I/O

I/O

I/O

P0

P2

P1

P3

P4

P6

P5

P7

Page 21: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future21

Why Quad-Core?

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1.00x

Max Frequency

Performance

Power

Baseline is 2 Node x 2 Core blade running OLTP

Page 22: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future22

Increasing Frequency

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1.14x

1.51x

1.00x

Increased Freq

(+16%)

Max Frequency

Performance

Power

Baseline is 2 Node x 2 Core blade running OLTP

Page 23: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future23

Decreasing Frequency

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1.14x

1.51x

1.00x0.87x

Increased Freq

(+16%)

Max Frequency Decreased Freq

(-16%)

Performance

Power

Baseline is 2 Node x 2 Core blade running OLTP

0.49x

Page 24: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future24

Quad-CoreHigher Performance within a Fixed Power Budget

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1.14x

1.51x

1.00x

1.73x

0.99x

Increased Freq

(+16%)

Max Frequency Decreased Freq

(-16%)

Quad-Core

Performance

Power

Baseline is 2 Node x 2 Core blade running OLTP

Page 25: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future25

Northbridge

VDDA

VHTVDDIO,VTT

Clock and Power Planes

HyperTransport Links

Misc

VDDNB

VDD

DIMMs

SVI

Core 0

PLL

Core 1

PLL

Core 2

PLL

Core 3

PLL

PLLs PLLsPLL

VRM

Page 26: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future26

100% Workload

100% Workload

100% Workload

100% Workload

100% Power State

Ability to dynamically and individually adjust core frequencies to improve power efficiency

DICE: Dynamic Independent Core Engagement

Page 27: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future27

100% Workload

33% Workload

33% Workload

33% Workload

DICE: Dynamic Independent Core Engagement

60% Power State

Ability to dynamically and individually adjust core frequencies to improve power efficiency

Page 28: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future28

100% Workload

50% Workload

Halted Halted

45% Power State

Ability to dynamically and individually adjust core frequencies for improved power efficiency

DICE: Dynamic Independent Core Engagement

Page 29: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future29

Enjoy the rest of the conference !

www.amd.com/power

Page 30: HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future30

Trademark Attribution

AMD, the AMD Arrow, AMD Opteron, AMD PowerNow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport and HTX are trademarks of the HyperTransport Consortium. PCIe is a trademark of the PCI-SIG. Other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.


Recommended