Home | Computer Science - The Intel Pentium 4...

IntelIntel®PDXPDXCopyright © 2002 Intel Corporation.

The Intel® Pentium® 4 Processor

Doug CarmeanDoug CarmeanPrincipal ArchitectPrincipal ArchitectIntel Architecture GroupIntel Architecture Group

Spring 2002Spring 2002


AgendaAgenda

ReviewReviewPipeline DepthPipeline DepthExecution Trace CacheExecution Trace CacheData SpeculationData SpeculationSpec PerformanceSpec PerformanceSummarySummary


Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel’s Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel® products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications.

Intel may make changes to specifications and product descriptions at any time, without notice.

This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design.

Intel processors may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel, Pentium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and foreign countries.

Copyright © (2001) Intel Corporation.


IntelIntel® NetburstNetburstTMTM MicroMicro--architecture vs P6architecture vs P6

11 22 33 44 55 66 77 88 99 1010FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec

Basic P6 PipelineBasic P6 Pipeline

Basic PentiumBasic Pentium®® 4 Processor Pipeline4 Processor Pipeline

11 22 33 44 55 66 77 88 99 1010 1111 1212TC TC NxtNxt IPIP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch

1313 1414DispDisp DispDisp

1515 1616 1717 1818 1919 2020RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF

Deeper Pipelines enable higher frequency and performanceDeeper Pipelines enable higher frequency and performance


Freq

uenc

yFr

eque

ncy

TimeTimeIntroductionIntroduction

233MHz233MHz

60MHz60MHz P5 MicroP5 Micro--ArchitectureArchitecture55

Hyper Pipelined TechnologyHyper Pipelined Technology

1.2GHz1.2GHz

166MHz166MHz

P6 MicroP6 Micro--ArchitectureArchitecture

1010

2020Netburst MicroNetburst Micro--ArchitectureArchitecture

IntroIntro 1.5GHz1.5GHz

1.8GHz1.8GHz

2.2GHz2.2GHzTodayToday


Deeper Pipelines are BetterDeeper Pipelines are Better

0%

20%

40%

60%

80%

100%

120%

10 15 20 25 30Pipeline Depth

Per

form

ance

Impr

ovem

ent 2 MB

1 MB

512 KB

256 KB

Source: Average of 2000 application segments from performance siSource: Average of 2000 application segments from performance simulationsmulations


Why not deeper pipelines?Why not deeper pipelines?

Increases complexityIncreases complexity––Harder to balanceHarder to balance––More challenges to architect aroundMore challenges to architect around––More algorithmsMore algorithms––Greater validation effortGreater validation effort––Need to pipeline the wiresNeed to pipeline the wires

Overall Engineering Effort Increases Quickly Overall Engineering Effort Increases Quickly as Pipeline depth increasesas Pipeline depth increases


PerformancePerformance

High bandwidth front endHigh bandwidth front endLow latency coreLow latency coreLower memory latencyLower memory latencyHigh Bandwidth Front EndHigh Bandwidth Front EndHigh Bandwidth Front End


Higher Frequency increases Higher Frequency increases requirements of front endrequirements of front end

Branch prediction is more importantBranch prediction is more important––So we improved itSo we improved it

Need greater uop bandwidthNeed greater uop bandwidth––Branches constantly change the flowBranches constantly change the flow––Need to decode more instructions in Need to decode more instructions in

parallelparallel


Block DiagramBlock Diagram

BusBusInterfaceInterface

UnitUnit

Quad Quad PumpedPumped3.2 GB/s3.2 GB/s

L2L2CacheCache

64GB/s64GB/s

SystemSystemBusBus

6464--bit widebit wide Instruction TLBInstruction TLB Dynamic BTBDynamic BTB4K Entries4K Entries

Instruction DecoderInstruction Decoder

Execution Trace Execution Trace Cache 12K Cache 12K µµopsops

Trace Cache BTBTrace Cache BTB512 Entries512 Entries

MicroMicroInstructionInstructionSequencerSequencer

AllocatorAllocator / Register / Register RenamerRenamer

Memory Memory µµopop QueueQueue Integer/Floating Point Integer/Floating Point µµopop QueueQueue

MemoryMemory SlowSlow Fast Fast IntInt Fast Fast IntInt FP MoveFP Move FP GenFP Gen

Ld/St Ld/St Address Address

unitunit

Integer Register File / Bypass NetworkInteger Register File / Bypass Network FP Register / BypassFP Register / Bypass

2xAGU2xAGU

Complex Complex InstrInstr..

Slow ALUSlow ALU

Simple Simple InstrInstr..

2xALU2xALU


2xALU2xALU

L1 Data Cache 8Kbyte 4L1 Data Cache 8Kbyte 4--wayway

FP FP MoveMove

FmulFmulFAddFAdd

FmulFmulFAddFAdd

FmulFmulFAddFAdd

256256--bit widebit wide

Integer SchedulersInteger Schedulers Floating Point SchedulersFloating Point Schedulers

Execution Trace Cache 12K uops


Execution Trace CacheExecution Trace Cache

1 1 cmpcmp2 br 2 br --> T1 > T1

....

... (unused code)... (unused code)T1:T1: 3 sub3 sub

4 br 4 br --> T2> T2....... (unused code)... (unused code)

T2:T2: 5 5 movmov6 sub6 sub7 br 7 br --> T3> T3

....

... (unused code)... (unused code)T3:T3: 8 add8 add

9 sub9 sub10 10 mulmul11 11 cmpcmp12 br 12 br --> T4> T4

Trace Cache DeliveryTrace Cache Delivery

10 mul 11 cmp 12 br T4

7 br T3 8 T3:add 9 sub

4 br T2 5 mov 6 sub

1 cmp 2 br T1 3 T1: sub



1 cmp 2 br T1

3 T1: sub 4 br T2

5 mov 6 sub 7 br T3

8 T3:add 9 sub 10 mul

11 cmp 12 br T4

P6 MicroarchitectureP6 Microarchitecture Trace Cache DeliveryTrace Cache Delivery

10 mul 11 cmp 12 br T4

7 br T3 8 T3:add 9 sub

4 br T2 5 mov 6 sub

1 cmp 2 br T1 3 T1: sub

BW = 1.5 uops/nsBW = 1.5 uops/ns BW = 6 uops/nsBW = 6 uops/ns


Inside the Execution Trace CacheInside the Execution Trace Cache

0x0900InstructionPointer head

body 1body 2

tail

Way 0 Way 1 Way 2 Way 3Set 0Set 1Set 2Set 3Set 4

head cmp, br T1, T1:sub, br T2, mov, sub

body 1 br T3, T3:add, sub, mul, cmp, br T4

body 2 T4:add, sub, mov, add, add, mov

tail add, sub, mov, add, add, mov


Self Modifying CodeSelf Modifying Code

Programs that modify the instruction Programs that modify the instruction stream that is being executedstream that is being executedVery common in Java* code from JITsVery common in Java* code from JITsRequires hardware mechanisms to Requires hardware mechanisms to maintain consistencymaintain consistency

*Other names and brands may be claimed as the property of others.


Self Modifying CodeSelf Modifying Code

The hardware needs to handle two The hardware needs to handle two basic cases:basic cases:––Stores that write to instructions in the Stores that write to instructions in the

Trace CacheTrace Cache–– Instruction fetches that hit pending storesInstruction fetches that hit pending stores

–– SpeculativeSpeculative–– CommittedCommitted


Case 1: Stores to cached instructionsCase 1: Stores to cached instructions

Execution Core

DataTLB Store’s Physical Address

Trace Cache

addraddr

InstructionTLB(128 entries)

addr

“in use” bits


Case 2: Fetches to pending storesCase 2: Fetches to pending stores

InstructionTLB(128 entries)

addraddr

addr

InstructionPointer

Store Buffer

WriteCombiningBuffer

Execution Core

PleaseRe-Fetch

PleaseFlush Pipeline

PleaseRe-Fetch

Speculative

Committed



Provides higher bandwidth for higher Provides higher bandwidth for higher frequency corefrequency coreReduces fetch latencyReduces fetch latencyRequires new fundamentally new Requires new fundamentally new algorithmsalgorithms



High bandwidth front endHigh bandwidth front endLow latency coreLow latency coreLower memory latencyLower memory latency

Low Latency CoreLow Latency CoreLow Latency Core


Data SpeculationData Speculation

Use data before we are sure it is validUse data before we are sure it is valid–– Lowers effective LD latencyLowers effective LD latency–– Fast ALUs in Pentium 4 want fast LDsFast ALUs in Pentium 4 want fast LDs–– Ratio of LD latency to ADD latency is important if Ratio of LD latency to ADD latency is important if

1 in 5 uops is a LD1 in 5 uops is a LD

As pipelines get deeper, data As pipelines get deeper, data speculation gets more importantspeculation gets more important–– Number of cycles saved /w data speculation Number of cycles saved /w data speculation

increases as pipeline depth increasesincreases as pipeline depth increases


L1 Data CacheL1 Data Cache


UnitUnit


L2L2CacheCache

64GB/s64GB/s

SystemSystemBusBus










unitunit


2xAGU2xAGU


Slow ALUSlow ALU


2xALU2xALU


2xALU2xALU


FP FP MoveMove

FmulFmulFAddFAdd

FmulFmulFAddFAdd

FmulFmulFAddFAdd


Integer SchedulersInteger Schedulers Floating Point SchedulersFloating Point Schedulers

L1 Data Cache 8Kbyte 4-way


L1 Cache is >3x FasterL1 Cache is >3x Faster

P6:P6:––3 clocks @ 1GHz

3ns3ns

3 clocks @ 1GHz

P4P:P4P:––2 clocks @ 2GHz2 clocks @ 2GHz

1ns1ns

Lower Latency is Higher PerformanceLower Latency is Higher PerformanceLower Latency is Higher Performance



Pipeline Stages

2x Clock

VA15:0

VA31:16

FastSB

replay replay replay

VA31:0 TAG

SlowSB

TLB

data

Data

tag



2x Clock2x Clock

tagtag

TAGTAG

SlowSBSlowSB

VA15:0VA15:0

VA31:16VA31:16

datadata

FastSBFastSB

VA31:0VA31:0

DataDataData

TLBTLB

10:6 15:11

Way Predictor (Tag array)

Wayselect…

Data array

19:16

Hit(Replay)…


A Digression on StoresA Digression on Stores

Two components to a store:Two components to a store:–– STA: address computationSTA: address computation–– STD: data pieceSTD: data piece

Hybrid uOP Hybrid uOP –– Single uOP in the front, back endsSingle uOP in the front, back ends–– Two uOPs in the middleTwo uOPs in the middle


Memory DisambiguationMemory Disambiguation

storestoreSTA STA AddrBAddrB, STD , STD DataBDataBLd EAX <Ld EAX <-- AddrAAddrA

If the store is olderIf the store is older––And And AddrAAddrA = = AddrBAddrB––Then the load must get Then the load must get DataBDataB

Dependencies can not be resolved Dependencies can not be resolved until executionuntil execution


Memory DisambiguationMemory DisambiguationOption 1Option 1

LD ALD A

STD BSTD BSTA BSTA B

Loads wait for:Loads wait for:All older All older STAsSTAs ANDANDAll older STDsAll older STDs

No RecoveryNo RecoveryK7?K7?

Option 2Option 2

LD ALD A


Loads wait for:Loads wait for:All older All older STAsSTAs

STD RecoverySTD Recovery

PredictPredict

Complex RecoveryComplex RecoveryP4P4

Option 3Option 3

LD ALD A


predictorpredictor

Specific older Specific older STAsSTAsSpecific older STDsSpecific older STDs

EV8EV8


ExampleExample

ST writes new ST writes new valuevalueLD2 forwardsLD2 forwardsBranch resolves Branch resolves based on new based on new valuevalue

LD1

STA, STD

LD2

BR

?

XOR

ML=0ML=0

ML=1ML=1

JMP if ML=0JMP if ML=0


ExampleExample

LD1

STA, STD

LD2

BR

?

XOR

ML=0ML=0

ML=1ML=1

JMP if ML=0JMP if ML=0

LD1 LD misses, replaysLD misses, replays

STA, STD STA dependent, replaysSTA dependent, replays

LD hits, gets old valueLD hits, gets old value

BR mispredictsBR mispredicts


ExampleExample

LD1

STA, STD

LD2

BR

XOR

STA, STD

If LD2 depends on the If LD2 depends on the STA, they are usually STA, they are usually part of the same part of the same dataflow graphdataflow graphIf the STA replays, the If the STA replays, the LD usually has an LD usually has an address dependenceaddress dependence


Cautious ModeCautious Mode

Normally, aggressively scheduleNormally, aggressively scheduleIf a large number of problems occur If a large number of problems occur enter Cautious Modeenter Cautious ModeIn Cautious Mode, branches wait for In Cautious Mode, branches wait for data to be nondata to be non--speculativespeculative–– Increases branch misprediction latencyIncreases branch misprediction latency––Completely eliminates problemsCompletely eliminates problems


Cautious Mode: ImplementationCautious Mode: Implementation

A simple state machine A simple state machine cleans up the outlierscleans up the outliers–– Out of 2200 traces, 3 traces Out of 2200 traces, 3 traces

speedup >20%speedup >20%–– The other traces are The other traces are

unaffectedunaffected–– Average performance Average performance

improvement < 0.1%improvement < 0.1%




Lower Memory LatencyLower Memory LatencyLower Memory Latency


Reducing LatencyReducing Latency

As frequency increases, it is important As frequency increases, it is important to improve the performance of the to improve the performance of the memory subsystemmemory subsystemData Prefetch LogicData Prefetch Logic––Watches processor memory trafficWatches processor memory traffic––Looks for patternsLooks for patterns–– Initiates accessesInitiates accesses


Data Prefetch LogicData Prefetch Logic


UnitUnit


L2L2CacheCache

64GB/s64GB/s

SystemSystemBusBus










unitunit


2xAGU2xAGU


Slow ALUSlow ALU


2xALU2xALU


2xALU2xALU


FP FP MoveMove

FmulFmulFAddFAdd

FmulFmulFAddFAdd

FmulFmulFAddFAdd


Integer SchedulersInteger Schedulers Floating Point SchedulersFloating Point SchedulersData

Prefetch Logic



L1Data Cache

InstructionFetch

InstructionBuffers

64

256

L2AdvancedTransferCache

Data PrefetchLogic

BusQueue

Prefetch logic first checks L2 cache and then fetches lines from memory that miss L2 cache.



Watches for streaming memory access Watches for streaming memory access patternspatterns–– Can track 8 multiple independent streamsCan track 8 multiple independent streams–– Loads, Stores or InstructionLoads, Stores or Instruction–– Forward or BackwardForward or Backward

Analysis on 32 byte cache line granularityAnalysis on 32 byte cache line granularityLooks for “mostly” complete streams:Looks for “mostly” complete streams:–– Access to cache lines 1,2,3,4,5,6 will prefetchAccess to cache lines 1,2,3,4,5,6 will prefetch–– Access to cache lines 1,2, 4,5,6 will prefetchAccess to cache lines 1,2, 4,5,6 will prefetch–– 1, ,3, , ,6, , ,9 will not prefetch1, ,3, , ,6, , ,9 will not prefetch





Date post:	24-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Home | Computer Science - The Intel Pentium 4...

Documents