IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
The Intel® Pentium® 4 Processor
Doug CarmeanDoug CarmeanPrincipal ArchitectPrincipal ArchitectIntel Architecture GroupIntel Architecture Group
Spring 2002Spring 2002
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
AgendaAgenda
ReviewReviewPipeline DepthPipeline DepthExecution Trace CacheExecution Trace CacheData SpeculationData SpeculationSpec PerformanceSpec PerformanceSummarySummary
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel’s Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel® products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design.
Intel processors may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel, Pentium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and foreign countries.
Copyright © (2001) Intel Corporation.
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
IntelIntel® NetburstNetburstTMTM MicroMicro--architecture vs P6architecture vs P6
11 22 33 44 55 66 77 88 99 1010FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec
Basic P6 PipelineBasic P6 Pipeline
Basic PentiumBasic Pentium®® 4 Processor Pipeline4 Processor Pipeline
11 22 33 44 55 66 77 88 99 1010 1111 1212TC TC NxtNxt IPIP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414DispDisp DispDisp
1515 1616 1717 1818 1919 2020RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF
Deeper Pipelines enable higher frequency and performanceDeeper Pipelines enable higher frequency and performance
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Freq
uenc
yFr
eque
ncy
TimeTimeIntroductionIntroduction
233MHz233MHz
60MHz60MHz P5 MicroP5 Micro--ArchitectureArchitecture55
Hyper Pipelined TechnologyHyper Pipelined Technology
1.2GHz1.2GHz
166MHz166MHz
P6 MicroP6 Micro--ArchitectureArchitecture
1010
2020Netburst MicroNetburst Micro--ArchitectureArchitecture
IntroIntro 1.5GHz1.5GHz
1.8GHz1.8GHz
2.2GHz2.2GHzTodayToday
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Deeper Pipelines are BetterDeeper Pipelines are Better
0%
20%
40%
60%
80%
100%
120%
10 15 20 25 30Pipeline Depth
Per
form
ance
Impr
ovem
ent 2 MB
1 MB
512 KB
256 KB
Source: Average of 2000 application segments from performance siSource: Average of 2000 application segments from performance simulationsmulations
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Why not deeper pipelines?Why not deeper pipelines?
Increases complexityIncreases complexity––Harder to balanceHarder to balance––More challenges to architect aroundMore challenges to architect around––More algorithmsMore algorithms––Greater validation effortGreater validation effort––Need to pipeline the wiresNeed to pipeline the wires
Overall Engineering Effort Increases Quickly Overall Engineering Effort Increases Quickly as Pipeline depth increasesas Pipeline depth increases
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
PerformancePerformance
High bandwidth front endHigh bandwidth front endLow latency coreLow latency coreLower memory latencyLower memory latencyHigh Bandwidth Front EndHigh Bandwidth Front EndHigh Bandwidth Front End
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Higher Frequency increases Higher Frequency increases requirements of front endrequirements of front end
Branch prediction is more importantBranch prediction is more important––So we improved itSo we improved it
Need greater uop bandwidthNeed greater uop bandwidth––Branches constantly change the flowBranches constantly change the flow––Need to decode more instructions in Need to decode more instructions in
parallelparallel
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Block DiagramBlock Diagram
BusBusInterfaceInterface
UnitUnit
Quad Quad PumpedPumped3.2 GB/s3.2 GB/s
L2L2CacheCache
64GB/s64GB/s
SystemSystemBusBus
6464--bit widebit wide Instruction TLBInstruction TLB Dynamic BTBDynamic BTB4K Entries4K Entries
Instruction DecoderInstruction Decoder
Execution Trace Execution Trace Cache 12K Cache 12K µµopsops
Trace Cache BTBTrace Cache BTB512 Entries512 Entries
MicroMicroInstructionInstructionSequencerSequencer
AllocatorAllocator / Register / Register RenamerRenamer
Memory Memory µµopop QueueQueue Integer/Floating Point Integer/Floating Point µµopop QueueQueue
MemoryMemory SlowSlow Fast Fast IntInt Fast Fast IntInt FP MoveFP Move FP GenFP Gen
Ld/St Ld/St Address Address
unitunit
Integer Register File / Bypass NetworkInteger Register File / Bypass Network FP Register / BypassFP Register / Bypass
2xAGU2xAGU
Complex Complex InstrInstr..
Slow ALUSlow ALU
Simple Simple InstrInstr..
2xALU2xALU
Simple Simple InstrInstr..
2xALU2xALU
L1 Data Cache 8Kbyte 4L1 Data Cache 8Kbyte 4--wayway
FP FP MoveMove
FmulFmulFAddFAdd
FmulFmulFAddFAdd
FmulFmulFAddFAdd
256256--bit widebit wide
Integer SchedulersInteger Schedulers Floating Point SchedulersFloating Point Schedulers
Execution Trace Cache 12K uops
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Execution Trace CacheExecution Trace Cache
1 1 cmpcmp2 br 2 br --> T1 > T1
....
... (unused code)... (unused code)T1:T1: 3 sub3 sub
4 br 4 br --> T2> T2....... (unused code)... (unused code)
T2:T2: 5 5 movmov6 sub6 sub7 br 7 br --> T3> T3
....
... (unused code)... (unused code)T3:T3: 8 add8 add
9 sub9 sub10 10 mulmul11 11 cmpcmp12 br 12 br --> T4> T4
Trace Cache DeliveryTrace Cache Delivery
10 mul 11 cmp 12 br T4
7 br T3 8 T3:add 9 sub
4 br T2 5 mov 6 sub
1 cmp 2 br T1 3 T1: sub
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Execution Trace CacheExecution Trace Cache
1 cmp 2 br T1
3 T1: sub 4 br T2
5 mov 6 sub 7 br T3
8 T3:add 9 sub 10 mul
11 cmp 12 br T4
P6 MicroarchitectureP6 Microarchitecture Trace Cache DeliveryTrace Cache Delivery
10 mul 11 cmp 12 br T4
7 br T3 8 T3:add 9 sub
4 br T2 5 mov 6 sub
1 cmp 2 br T1 3 T1: sub
BW = 1.5 uops/nsBW = 1.5 uops/ns BW = 6 uops/nsBW = 6 uops/ns
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Inside the Execution Trace CacheInside the Execution Trace Cache
0x0900InstructionPointer head
body 1body 2
tail
Way 0 Way 1 Way 2 Way 3Set 0Set 1Set 2Set 3Set 4
head cmp, br T1, T1:sub, br T2, mov, sub
body 1 br T3, T3:add, sub, mul, cmp, br T4
body 2 T4:add, sub, mov, add, add, mov
tail add, sub, mov, add, add, mov
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Self Modifying CodeSelf Modifying Code
Programs that modify the instruction Programs that modify the instruction stream that is being executedstream that is being executedVery common in Java* code from JITsVery common in Java* code from JITsRequires hardware mechanisms to Requires hardware mechanisms to maintain consistencymaintain consistency
*Other names and brands may be claimed as the property of others.
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Self Modifying CodeSelf Modifying Code
The hardware needs to handle two The hardware needs to handle two basic cases:basic cases:––Stores that write to instructions in the Stores that write to instructions in the
Trace CacheTrace Cache–– Instruction fetches that hit pending storesInstruction fetches that hit pending stores
–– SpeculativeSpeculative–– CommittedCommitted
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Case 1: Stores to cached instructionsCase 1: Stores to cached instructions
Execution Core
DataTLB Store’s Physical Address
Trace Cache
addraddr
InstructionTLB(128 entries)
addr
“in use” bits
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Case 2: Fetches to pending storesCase 2: Fetches to pending stores
InstructionTLB(128 entries)
addraddr
addr
InstructionPointer
Store Buffer
WriteCombiningBuffer
Execution Core
PleaseRe-Fetch
PleaseFlush Pipeline
PleaseRe-Fetch
Speculative
Committed
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Execution Trace CacheExecution Trace Cache
Provides higher bandwidth for higher Provides higher bandwidth for higher frequency corefrequency coreReduces fetch latencyReduces fetch latencyRequires new fundamentally new Requires new fundamentally new algorithmsalgorithms
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
PerformancePerformance
High bandwidth front endHigh bandwidth front endLow latency coreLow latency coreLower memory latencyLower memory latency
Low Latency CoreLow Latency CoreLow Latency Core
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Data SpeculationData Speculation
Use data before we are sure it is validUse data before we are sure it is valid–– Lowers effective LD latencyLowers effective LD latency–– Fast ALUs in Pentium 4 want fast LDsFast ALUs in Pentium 4 want fast LDs–– Ratio of LD latency to ADD latency is important if Ratio of LD latency to ADD latency is important if
1 in 5 uops is a LD1 in 5 uops is a LD
As pipelines get deeper, data As pipelines get deeper, data speculation gets more importantspeculation gets more important–– Number of cycles saved /w data speculation Number of cycles saved /w data speculation
increases as pipeline depth increasesincreases as pipeline depth increases
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
L1 Data CacheL1 Data Cache
BusBusInterfaceInterface
UnitUnit
Quad Quad PumpedPumped3.2 GB/s3.2 GB/s
L2L2CacheCache
64GB/s64GB/s
SystemSystemBusBus
6464--bit widebit wide Instruction TLBInstruction TLB Dynamic BTBDynamic BTB4K Entries4K Entries
Instruction DecoderInstruction Decoder
Execution Trace Execution Trace Cache 12K Cache 12K µµopsops
Trace Cache BTBTrace Cache BTB512 Entries512 Entries
MicroMicroInstructionInstructionSequencerSequencer
AllocatorAllocator / Register / Register RenamerRenamer
Memory Memory µµopop QueueQueue Integer/Floating Point Integer/Floating Point µµopop QueueQueue
MemoryMemory SlowSlow Fast Fast IntInt Fast Fast IntInt FP MoveFP Move FP GenFP Gen
Ld/St Ld/St Address Address
unitunit
Integer Register File / Bypass NetworkInteger Register File / Bypass Network FP Register / BypassFP Register / Bypass
2xAGU2xAGU
Complex Complex InstrInstr..
Slow ALUSlow ALU
Simple Simple InstrInstr..
2xALU2xALU
Simple Simple InstrInstr..
2xALU2xALU
L1 Data Cache 8Kbyte 4L1 Data Cache 8Kbyte 4--wayway
FP FP MoveMove
FmulFmulFAddFAdd
FmulFmulFAddFAdd
FmulFmulFAddFAdd
256256--bit widebit wide
Integer SchedulersInteger Schedulers Floating Point SchedulersFloating Point Schedulers
L1 Data Cache 8Kbyte 4-way
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
L1 Cache is >3x FasterL1 Cache is >3x Faster
P6:P6:––3 clocks @ 1GHz
3ns3ns
3 clocks @ 1GHz
P4P:P4P:––2 clocks @ 2GHz2 clocks @ 2GHz
1ns1ns
Lower Latency is Higher PerformanceLower Latency is Higher PerformanceLower Latency is Higher Performance
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
L1 Data CacheL1 Data Cache
Pipeline Stages
2x Clock
VA15:0
VA31:16
FastSB
replay replay replay
VA31:0 TAG
SlowSB
TLB
data
Data
tag
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
L1 Data CacheL1 Data Cache
2x Clock2x Clock
tagtag
TAGTAG
SlowSBSlowSB
VA15:0VA15:0
VA31:16VA31:16
datadata
FastSBFastSB
VA31:0VA31:0
DataDataData
TLBTLB
10:6 15:11
Way Predictor (Tag array)
Wayselect…
Data array
19:16
Hit(Replay)…
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
A Digression on StoresA Digression on Stores
Two components to a store:Two components to a store:–– STA: address computationSTA: address computation–– STD: data pieceSTD: data piece
Hybrid uOP Hybrid uOP –– Single uOP in the front, back endsSingle uOP in the front, back ends–– Two uOPs in the middleTwo uOPs in the middle
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Memory DisambiguationMemory Disambiguation
storestoreSTA STA AddrBAddrB, STD , STD DataBDataBLd EAX <Ld EAX <-- AddrAAddrA
If the store is olderIf the store is older––And And AddrAAddrA = = AddrBAddrB––Then the load must get Then the load must get DataBDataB
Dependencies can not be resolved Dependencies can not be resolved until executionuntil execution
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Memory DisambiguationMemory DisambiguationOption 1Option 1
LD ALD A
STD BSTD BSTA BSTA B
Loads wait for:Loads wait for:All older All older STAsSTAs ANDANDAll older STDsAll older STDs
No RecoveryNo RecoveryK7?K7?
Option 2Option 2
LD ALD A
STD BSTD BSTA BSTA B
Loads wait for:Loads wait for:All older All older STAsSTAs
STD RecoverySTD Recovery
PredictPredict
Complex RecoveryComplex RecoveryP4P4
Option 3Option 3
LD ALD A
STD BSTD BSTA BSTA B
predictorpredictor
Specific older Specific older STAsSTAsSpecific older STDsSpecific older STDs
EV8EV8
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
ExampleExample
ST writes new ST writes new valuevalueLD2 forwardsLD2 forwardsBranch resolves Branch resolves based on new based on new valuevalue
LD1
STA, STD
LD2
BR
?
XOR
ML=0ML=0
ML=1ML=1
JMP if ML=0JMP if ML=0
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
ExampleExample
LD1
STA, STD
LD2
BR
?
XOR
ML=0ML=0
ML=1ML=1
JMP if ML=0JMP if ML=0
LD1 LD misses, replaysLD misses, replays
STA, STD STA dependent, replaysSTA dependent, replays
LD hits, gets old valueLD hits, gets old value
BR mispredictsBR mispredicts
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
ExampleExample
LD1
STA, STD
LD2
BR
XOR
STA, STD
If LD2 depends on the If LD2 depends on the STA, they are usually STA, they are usually part of the same part of the same dataflow graphdataflow graphIf the STA replays, the If the STA replays, the LD usually has an LD usually has an address dependenceaddress dependence
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Cautious ModeCautious Mode
Normally, aggressively scheduleNormally, aggressively scheduleIf a large number of problems occur If a large number of problems occur enter Cautious Modeenter Cautious ModeIn Cautious Mode, branches wait for In Cautious Mode, branches wait for data to be nondata to be non--speculativespeculative–– Increases branch misprediction latencyIncreases branch misprediction latency––Completely eliminates problemsCompletely eliminates problems
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Cautious Mode: ImplementationCautious Mode: Implementation
A simple state machine A simple state machine cleans up the outlierscleans up the outliers–– Out of 2200 traces, 3 traces Out of 2200 traces, 3 traces
speedup >20%speedup >20%–– The other traces are The other traces are
unaffectedunaffected–– Average performance Average performance
improvement < 0.1%improvement < 0.1%
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
PerformancePerformance
High bandwidth front endHigh bandwidth front endLow latency coreLow latency coreLower memory latencyLower memory latency
Lower Memory LatencyLower Memory LatencyLower Memory Latency
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Reducing LatencyReducing Latency
As frequency increases, it is important As frequency increases, it is important to improve the performance of the to improve the performance of the memory subsystemmemory subsystemData Prefetch LogicData Prefetch Logic––Watches processor memory trafficWatches processor memory traffic––Looks for patternsLooks for patterns–– Initiates accessesInitiates accesses
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Data Prefetch LogicData Prefetch Logic
BusBusInterfaceInterface
UnitUnit
Quad Quad PumpedPumped3.2 GB/s3.2 GB/s
L2L2CacheCache
64GB/s64GB/s
SystemSystemBusBus
6464--bit widebit wide Instruction TLBInstruction TLB Dynamic BTBDynamic BTB4K Entries4K Entries
Instruction DecoderInstruction Decoder
Execution Trace Execution Trace Cache 12K Cache 12K µµopsops
Trace Cache BTBTrace Cache BTB512 Entries512 Entries
MicroMicroInstructionInstructionSequencerSequencer
AllocatorAllocator / Register / Register RenamerRenamer
Memory Memory µµopop QueueQueue Integer/Floating Point Integer/Floating Point µµopop QueueQueue
MemoryMemory SlowSlow Fast Fast IntInt Fast Fast IntInt FP MoveFP Move FP GenFP Gen
Ld/St Ld/St Address Address
unitunit
Integer Register File / Bypass NetworkInteger Register File / Bypass Network FP Register / BypassFP Register / Bypass
2xAGU2xAGU
Complex Complex InstrInstr..
Slow ALUSlow ALU
Simple Simple InstrInstr..
2xALU2xALU
Simple Simple InstrInstr..
2xALU2xALU
L1 Data Cache 8Kbyte 4L1 Data Cache 8Kbyte 4--wayway
FP FP MoveMove
FmulFmulFAddFAdd
FmulFmulFAddFAdd
FmulFmulFAddFAdd
256256--bit widebit wide
Integer SchedulersInteger Schedulers Floating Point SchedulersFloating Point SchedulersData
Prefetch Logic
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Data Prefetch LogicData Prefetch Logic
L1Data Cache
InstructionFetch
InstructionBuffers
64
256
L2AdvancedTransferCache
Data PrefetchLogic
BusQueue
Prefetch logic first checks L2 cache and then fetches lines from memory that miss L2 cache.
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
Data Prefetch LogicData Prefetch Logic
Watches for streaming memory access Watches for streaming memory access patternspatterns–– Can track 8 multiple independent streamsCan track 8 multiple independent streams–– Loads, Stores or InstructionLoads, Stores or Instruction–– Forward or BackwardForward or Backward
Analysis on 32 byte cache line granularityAnalysis on 32 byte cache line granularityLooks for “mostly” complete streams:Looks for “mostly” complete streams:–– Access to cache lines 1,2,3,4,5,6 will prefetchAccess to cache lines 1,2,3,4,5,6 will prefetch–– Access to cache lines 1,2, 4,5,6 will prefetchAccess to cache lines 1,2, 4,5,6 will prefetch–– 1, ,3, , ,6, , ,9 will not prefetch1, ,3, , ,6, , ,9 will not prefetch
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.
PerformancePerformance
High bandwidth front endHigh bandwidth front endLow latency coreLow latency coreLower memory latencyLower memory latency
IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.