| Bobcat | Hot Chips 2010 1
“Bobcat”AMD’s New Low Power x86 Core Architecture
Brad Burgess, AMD FellowChief Architect / Bobcat Core
August 24, 2010
| Bobcat | Hot Chips 2010 2
Two x86 Cores Tuned for Target Markets
“Bulldozer”
Performance & Scalability
“Bobcat”
Flexible, Low Power & Small
Mainstream Client and Server Markets
Low PowerMarkets
SmallDie Area
Cloud Clients Optimized
| Bobcat | Hot Chips 2010 3
Bobcat Design Goals
A small, efficient, low power x86 core
Excellent performance
Synthesizable with small number of custom arrays
Easily Portable across process technologies
| Bobcat | Hot Chips 2010 4
Feature Set
64-bit AMD64 x86 ISA
SIMD extensions: SSE1, SSE2, SSE3, SSSE3, SSE4A
Virtualization
Support for misaligned 128-bit data types
Instruction Based Sampling (for dynamic optimization)
C6 (with integrated power gating)
| Bobcat | Hot Chips 2010 5
Micro-architecture Overview
Dual x86 instruction decode
Out-of-Order instruction execution
Dual COP retirement
Complex microOPs
State of the art branch prediction
Aggressive OOO load/store engine w/ hazard prediction
Advanced Virtualization w/ nested page tables, ASIDs and world switch acceleration
Low power C6 state w/ core level power gating and state save acceleration
| Bobcat | Hot Chips 2010 6
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
| Bobcat | Hot Chips 2010 7
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
Icache
32Kbyte
2-way set associative
64-byte line
Parity Protected
512/8 entry ITLB (4k/2m)
Fetch up to 32-bytes/cycle
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
| Bobcat | Hot Chips 2010 8
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
Branch Predictor:
Predicts up to two branches per cycle
Remembers branch instruction locations
Return Stack Address Predictor
Indirect Dynamic Address Predictor
State of the Art condition Predictor
Only necessary structures are clocked
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
| Bobcat | Hot Chips 2010 9
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
Dual x86 Decoder:
Scans up to 22 bytes
Decodes up to two x86 instructions per cycle
The decoder can directly map 89% of x86 instructions to a single microOp, an additional 10% to a pair of microOps, and more complicated x86 instructions (<1%) are microcoded. (Dynamic Instruction Counts)
| Bobcat | Hot Chips 2010 10
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
Integer Execution:
A dual port integer scheduler feeds two ALUs
A dual port address scheduler feeds a load address unit, and a store address unit.
Physical Register File uses maps and pointers to reduce power by minimizing data copying/movement.
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
Mul
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
| Bobcat | Hot Chips 2010 11
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
BobcatMicro-Architecture
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
Floating Point Unit:
A centralized FP scheduler feeds two 64-bit FP execution stacks
MMX and Logical units are replicated in both stacks
The FP Mul Unit can perform two SP multiplies per cycle
The FP Add Unit can perform two SP additions per cycle
A physical register file is used to reduce power
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
| Bobcat | Hot Chips 2010 12
BobcatMicro-Architecture
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
Data Cache:
32-Kbyte
8-way set associative
64-byte line
Parity Protected
Copyback
40/8 entry L1DTLB (4k/2m)
512/64 entry L2DTLB (4k/2m)
Advanced 8-stream prefetcher
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
| Bobcat | Hot Chips 2010 13
BobcatMicro-Architecture
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
Out-of-Order Load Store Unit:
Loads bypassing loads
Loads bypassing stores
Stores bypassing loads
Bypass tracking and dependency correction
Hazard predictor
Fast store forwarding
Fast critical word fill forwarding
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
| Bobcat | Hot Chips 2010 14
BobcatMicro-Architecture
To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
L2 Cache:
512Kbyte
16-way set associative
64 byte lines
ECC Protected
Half speed clocking for power reduction
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
BU512KB
L2CACHE
| Bobcat | Hot Chips 2010 15
BobcatMicro-Architecture
To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredict
orReturn Stack Dynamic Target
32KB
ICACHE
Fetch Queue
Bus Unit:
8-outstanding data accesses
2-outstanding fetch accesses
Eviction Buffers
Fill Buffers
Write combining buffers
Coherency management
uCode
ROB
Int Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
LdSt
Unit
32KB
DCACHEDTLB
Table Walker
Prefetch
BU512KB
L2CACHE
| Bobcat | Hot Chips 2010 16
EXEEXE
Bobcat Pipeline0 1 2 3 4 5 6 7 8 9 10 11 12
Branch Mispredict Latency13-cycles
Dec0 Dec1 Dec2 Pack FDec Dispatch Schedule RegRead ALU Writeback
uCode
ROMMDecFetch0 Fetch1 Fetch2 Fetch3 Fetch4 Fetch5
Transit FpDec RegRen Schedule RegRead EXE AGU DC1 DC2Writeback
Load Use LatencyL1 hit: 3-cycles
Transit L2Tag L2Data
Load Use LatencyL2 hit: 17-cycles
| Bobcat | Hot Chips 2010 17
Core Floor Plan
Inst
TLB/Tag
Instruction
Cache
Branch
Predict
Ucode
ROM
Test/Debug
X86 Decode
Floating Point Unit
Data L2 TLB
Bus Unit
L2 Sub Array
L2 TAG
ROB
Integer Unit
Load Store Unit
Data Tag/TLB
Data Cache
| Bobcat | Hot Chips 2010 18
Power Reduction
Use of physical Register files
Extensive use of non-shifting queues with pointers
Fine grain clock gating
Integrated Core Power Gating
Only needed arrays are clocked – i.e. Dtag hit before Dcache read
– Predicting the type of branch then clocking the appropriate predictor(s)
Elimination of instruction marker bits in the Icache
Finding the knee of the curve (scrutinize performance gains against power costs)
Polishing speed paths to raise the Vt mix and reduce leakage
| Bobcat | Hot Chips 2010 19
Bobcat Core OverviewAdvanced Micro-architecture Dual x86 Decode Advanced Branch Predictor Full OOO instruction execution Full OOO load/store engine High Performance Floating Point AMD64 64-bit ISA SSE1,2,3, SSSE3 ISA Secure Virtualization 32kb L1s, 512kb L2
Low Power Design Power Optimized Execution Micro-architecture that minimizes data movement
and unnecessary reads Clock gating, Power gating System Low Power States
Small Core Area efficient balance of high performance and low
power
A
Pipe
M
Pipe
FP
Scheduler
ICACHE
DCACHE
I
Pipe
Store
Pipe
Address
Scheduler
I
Pipe
Load
Pipe
Integer
Scheduler
Fetch
Bobcat Low
PowerCore
BU
L2
Decode
| Bobcat | Hot Chips 2010 20
Summary
Estimated 90% of the performance of today’s mainstream notebook CPU in half the area*
Sub-one watt capable
Highly portable across designs and manufacturing technologies
*Based on internal AMD modeling using benchmark simulations