+ All Categories
Home > Documents > Bobcat hotchips final 8 2 10

Bobcat hotchips final 8 2 10

Date post: 13-Dec-2014
Category:
Upload: mbasford
View: 400 times
Download: 0 times
Share this document with a friend
Description:
 
20
| Bobcat | Hot Chips 2010 1 “Bobcat” AMD’s New Low Power x86 Core Architecture Brad Burgess, AMD Fellow Chief Architect / Bobcat Core August 24, 2010
Transcript
Page 1: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 1

“Bobcat”AMD’s New Low Power x86 Core Architecture

Brad Burgess, AMD FellowChief Architect / Bobcat Core

August 24, 2010

Page 2: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 2

Two x86 Cores Tuned for Target Markets

“Bulldozer”

Performance & Scalability

“Bobcat”

Flexible, Low Power & Small

Mainstream Client and Server Markets

Low PowerMarkets

SmallDie Area

Cloud Clients Optimized

Page 3: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 3

Bobcat Design Goals

A small, efficient, low power x86 core

Excellent performance

Synthesizable with small number of custom arrays

Easily Portable across process technologies

Page 4: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 4

Feature Set

64-bit AMD64 x86 ISA

SIMD extensions: SSE1, SSE2, SSE3, SSSE3, SSE4A

Virtualization

Support for misaligned 128-bit data types

Instruction Based Sampling (for dynamic optimization)

C6 (with integrated power gating)

Page 5: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 5

Micro-architecture Overview

Dual x86 instruction decode

Out-of-Order instruction execution

Dual COP retirement

Complex microOPs

State of the art branch prediction

Aggressive OOO load/store engine w/ hazard prediction

Advanced Virtualization w/ nested page tables, ASIDs and world switch acceleration

Low power C6 state w/ core level power gating and state save acceleration

Page 6: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 6

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

Page 7: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 7

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

Icache

32Kbyte

2-way set associative

64-byte line

Parity Protected

512/8 entry ITLB (4k/2m)

Fetch up to 32-bytes/cycle

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

Page 8: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 8

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

Branch Predictor:

Predicts up to two branches per cycle

Remembers branch instruction locations

Return Stack Address Predictor

Indirect Dynamic Address Predictor

State of the Art condition Predictor

Only necessary structures are clocked

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

Page 9: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 9

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

Dual x86 Decoder:

Scans up to 22 bytes

Decodes up to two x86 instructions per cycle

The decoder can directly map 89% of x86 instructions to a single microOp, an additional 10% to a pair of microOps, and more complicated x86 instructions (<1%) are microcoded. (Dynamic Instruction Counts)

Page 10: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 10

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

Integer Execution:

A dual port integer scheduler feeds two ALUs

A dual port address scheduler feeds a load address unit, and a store address unit.

Physical Register File uses maps and pointers to reduce power by minimizing data copying/movement.

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

Mul

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

Page 11: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 11

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

BobcatMicro-Architecture

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

Floating Point Unit:

A centralized FP scheduler feeds two 64-bit FP execution stacks

MMX and Logical units are replicated in both stacks

The FP Mul Unit can perform two SP multiplies per cycle

The FP Add Unit can perform two SP additions per cycle

A physical register file is used to reduce power

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

Page 12: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 12

BobcatMicro-Architecture

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

Data Cache:

32-Kbyte

8-way set associative

64-byte line

Parity Protected

Copyback

40/8 entry L1DTLB (4k/2m)

512/64 entry L2DTLB (4k/2m)

Advanced 8-stream prefetcher

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

Page 13: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 13

BobcatMicro-Architecture

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

Out-of-Order Load Store Unit:

Loads bypassing loads

Loads bypassing stores

Stores bypassing loads

Bypass tracking and dependency correction

Hazard predictor

Fast store forwarding

Fast critical word fill forwarding

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

Page 14: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 14

BobcatMicro-Architecture

To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

L2 Cache:

512Kbyte

16-way set associative

64 byte lines

ECC Protected

Half speed clocking for power reduction

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

BU512KB

L2CACHE

Page 15: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 15

BobcatMicro-Architecture

To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredict

orReturn Stack Dynamic Target

32KB

ICACHE

Fetch Queue

Bus Unit:

8-outstanding data accesses

2-outstanding fetch accesses

Eviction Buffers

Fill Buffers

Write combining buffers

Coherency management

uCode

ROB

Int Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

LdSt

Unit

32KB

DCACHEDTLB

Table Walker

Prefetch

BU512KB

L2CACHE

Page 16: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 16

EXEEXE

Bobcat Pipeline0 1 2 3 4 5 6 7 8 9 10 11 12

Branch Mispredict Latency13-cycles

Dec0 Dec1 Dec2 Pack FDec Dispatch Schedule RegRead ALU Writeback

uCode

ROMMDecFetch0 Fetch1 Fetch2 Fetch3 Fetch4 Fetch5

Transit FpDec RegRen Schedule RegRead EXE AGU DC1 DC2Writeback

Load Use LatencyL1 hit: 3-cycles

Transit L2Tag L2Data

Load Use LatencyL2 hit: 17-cycles

Page 17: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 17

Core Floor Plan

Inst

TLB/Tag

Instruction

Cache

Branch

Predict

Ucode

ROM

Test/Debug

X86 Decode

Floating Point Unit

Data L2 TLB

Bus Unit

L2 Sub Array

L2 TAG

ROB

Integer Unit

Load Store Unit

Data Tag/TLB

Data Cache

Page 18: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 18

Power Reduction

Use of physical Register files

Extensive use of non-shifting queues with pointers

Fine grain clock gating

Integrated Core Power Gating

Only needed arrays are clocked – i.e. Dtag hit before Dcache read

– Predicting the type of branch then clocking the appropriate predictor(s)

Elimination of instruction marker bits in the Icache

Finding the knee of the curve (scrutinize performance gains against power costs)

Polishing speed paths to raise the Vt mix and reduce leakage

Page 19: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 19

Bobcat Core OverviewAdvanced Micro-architecture Dual x86 Decode Advanced Branch Predictor Full OOO instruction execution Full OOO load/store engine High Performance Floating Point AMD64 64-bit ISA SSE1,2,3, SSSE3 ISA Secure Virtualization 32kb L1s, 512kb L2

Low Power Design Power Optimized Execution Micro-architecture that minimizes data movement

and unnecessary reads Clock gating, Power gating System Low Power States

Small Core Area efficient balance of high performance and low

power

A

Pipe

M

Pipe

FP

Scheduler

ICACHE

DCACHE

I

Pipe

Store

Pipe

Address

Scheduler

I

Pipe

Load

Pipe

Integer

Scheduler

Fetch

Bobcat Low

PowerCore

BU

L2

Decode

Page 20: Bobcat hotchips final 8 2 10

| Bobcat | Hot Chips 2010 20

Summary

Estimated 90% of the performance of today’s mainstream notebook CPU in half the area*

Sub-one watt capable

Highly portable across designs and manufacturing technologies

*Based on internal AMD modeling using benchmark simulations


Recommended