Microprocessors (A)From the 386 to the Pentium 41
Dr. Martin LandHadassah CollegeSpring 2004
Intel Processors
from 386 to Pentium 4
Microprocessors (A)From the 386 to the Pentium 42
Dr. Martin LandHadassah CollegeSpring 2004
386
Microprocessors (A)From the 386 to the Pentium 43
Dr. Martin LandHadassah CollegeSpring 2004
Intel 80386 Microprocessor
Microprocessors (A)From the 386 to the Pentium 44
Dr. Martin LandHadassah CollegeSpring 2004
BusInterface
Unit
Address
Data
Control
PagingUnit
PhysicalAddress
ShadowRegisters
SegmentationUnit
LinearAddress
InstructionPrefetch
InstructionDecoderDecode and
Sequencing
ALU
Registers
Effe
ctiv
e Ad
dres
s (O
ffset
)
Code
Str
eam
:Li
near
byt
ese
quen
ce fr
om C
S
CodeStreamCode
Addr
ess
Dis
plac
emen
ts
MicroCode
StatusFlags
ALU (Data) Bus
Simplified 386 Microprocessor
Prefetch loads instruction bytes whenever there are no data accesses.Decoder identifies instruction boundaries and sends displacements to
Address Management.Decode/Sequence generates microcode for instruction execution.ALU sends Effective Address to Address Management for data access.Address Management handles segmentation and paging.Registers are updated in the last step.
Microprocessors (A)From the 386 to the Pentium 45
Dr. Martin LandHadassah CollegeSpring 2004
BusInterface
Unit
Address
Data
Control
PagingUnit
PhysicalAddress
ShadowRegisters
SegmentationUnit
LinearAddress
InstructionPrefetch
InstructionDecoderDecode and
Sequencing
ALU
Registers
Effe
ctiv
e Ad
dres
s (O
ffset
)
Code
Str
eam
:Li
near
byt
ese
quen
ce fr
om C
S
CodeStream
Code
Addr
ess
Dis
plac
emen
ts
MicroCode
StatusFlags
ALU (Data) Bus
Problems with Pipelining the 386
No Internal Data CacheAll data accesses are external (slow)Unified memory access causes structural hazard on data accesses
Instruction dependenciesLoad+ALU operations stall during load of data operandsConditional branches read status flags set by ALU instructionsLoad+ALU operations use register-based pointers (depend on previous
write-backs)Branches cause a flush of the Instruction Prefetch queue.
Microprocessors (A)From the 386 to the Pentium 46
Dr. Martin LandHadassah CollegeSpring 2004
486
Microprocessors (A)From the 386 to the Pentium 47
Dr. Martin LandHadassah CollegeSpring 2004
Upgrade of 80386
New Features in 486:• Pipelines 386 instruction execution• Floating-Point Unit (FPU) integrated on-chip• 8 or 16 KB L1 data cache on chip • Support for external L2 data cache• Multiprocessor support• Support for battery operated notebook PC
Microprocessors (A)From the 386 to the Pentium 48
Dr. Martin LandHadassah CollegeSpring 2004
Pipeline Organization
Each pipeline stage executes in one clock cycle
InstructionFetch
InstructionMemory
Stage-1Decode
Stage-2Decode Execute
DataMemory
WriteBack
AddressInstruction AddressData
Forwarding
Microprocessors (A)From the 386 to the Pentium 49
Dr. Martin LandHadassah CollegeSpring 2004
Five Pipelined StagesInstruction Prefetch (PF)Stage-1 Decode (D1)
Instruction IdentificationIdentify source operands:
Identify Register source Calculate Effective Address for Data Memory (cache) source
Stage-2 Decode (D2)Complete complex Effective Addresses Generation of Microcode
Execution (EX)Integer ALUFP ALUData memory writes
Write to fast memory bufferBuffer updates cache
Register Writeback (WB)
Microprocessors (A)From the 386 to the Pentium 410
Dr. Martin LandHadassah CollegeSpring 2004
486 Internal Organization
Bus Interface Unit (BIU)
Instruction Prefetch
Cache
Decoder
MMUALU
FPU
Microprocessors (A)From the 386 to the Pentium 411
Dr. Martin LandHadassah CollegeSpring 2004
Intel Architecture Floating Point Unit (FPU)
8087 numeric processorSeparate 8086 integer CPU and 8087 FPU
387 DX and SX math coprocessorsImplement the final IEEE STD 754Added new trigonometric instructions
486 processor FPUOn-chip equivalent of the Intel 387 DX math coprocessor IEEE STD 754
Pentium FPUCompletely redesigned FPU Conformance to both the IEEE STD 754 and 854Algorithms with three times the performance of 486Shortcut cost Intel a lot of money to correct
Microprocessors (A)From the 386 to the Pentium 412
Dr. Martin LandHadassah CollegeSpring 2004
Support for Battery Operated Notebook PC System Management Mode (SMM)
Special purpose interruptAddress space for storing processor stateTransparent to OS and applications software
Stop Clock StatesInitiated by external signal (hardware control)“Fast Wake-Up” Stop Grant state
Stops processor I/O operations“Slow Wake-Up” Stop Clock state
CLK frequency → 0 MHzAuto Halt Power Down
Similar to Stop ClockInitiated by HALT instruction (software control)
Dynamic Local Power ManagementSubsystems switch themselves off when not needed
Microprocessors (A)From the 386 to the Pentium 413
Dr. Martin LandHadassah CollegeSpring 2004
Pentium
Microprocessors (A)From the 386 to the Pentium 414
Dr. Martin LandHadassah CollegeSpring 2004
Superscalar Architecture
Two integer instruction pipelines“U” pipe can execute all integer instructions
“V” pipe can execute “simple integer” instructions
Floating Point Unit integrated with integer pipelinesEach pipeline can issue most instructions in
one clock cycle
Instruction issue: instruction execution stage (after fetch and decode have completed)
Microprocessors (A)From the 386 to the Pentium 415
Dr. Martin LandHadassah CollegeSpring 2004
Instruction Pairing
Process of issuing two instructions in parallelWhen instructions are paired:
First instruction issued to the U-pipeNext sequential instruction issued to V-pipe
Pairing not possible if:The instructions have dependenciesEither instruction is complex
1 2 3 4 51 2 3 4 5
U-pipeV-pipe
I1
I2
I3
I4
Microprocessors (A)From the 386 to the Pentium 416
Dr. Martin LandHadassah CollegeSpring 2004
Pipeline Stages for Pentium (without MMX)
PF D1 D2 EX WB Prefetch Instruction
Decode Address Generate
Execution Write Back
PFRAMD2
D2
EX
EX
WB
WB
U
V
Pipeline stages are very similar to 486 stages (not identical)
D1
D1
Microprocessors (A)From the 386 to the Pentium 417
Dr. Martin LandHadassah CollegeSpring 2004
Pentium Block Diagram
Microprocessors (A)From the 386 to the Pentium 418
Dr. Martin LandHadassah CollegeSpring 2004
Integer Instruction Pairing Rules
Pairing: two instructions issued on the same clockcycle (one to U-pipe and one to V-pipe)
Pairing requires the following conditions:1. Both instructions must be “simple”2. No RAW or WAW register dependencies between
instructions3. Register dependencies include pointers and flags4. Neither instruction contains both displacement and
immediate
Microprocessors (A)From the 386 to the Pentium 419
Dr. Martin LandHadassah CollegeSpring 2004
Branch Prediction ⎯ 1 Branch Target Buffer (BTB) is a special cache that stores information
about branch instructions:Source address (identifies particular branch instruction)Target address (“jump to” address)2 History bits provide 4 states: (11) strongly taken
(10) weakly taken(01) weakly not taken(00) strongly not taken
On a branch instruction,BTB makes predictions about branches:
Branch Taken or Branch Not Taken (by high order history bit)Target address (if Taken)
D1 decoder (Stage 2) reads prediction from the BTBInstructions are fetched according to predictionBranches that “miss” in BTB are treated as not taken
Microprocessors (A)From the 386 to the Pentium 420
Dr. Martin LandHadassah CollegeSpring 2004
Branch Prediction ⎯ 2
Branch Predictions are verified in EX or WB
On first verification of a branch instruction:
If Not Taken, no BTB entry is made
If Taken, the BTB creates a new entry:Instruction address of branch instructionBranch target addressPrediction that branch is strongly taken
Microprocessors (A)From the 386 to the Pentium 421
Dr. Martin LandHadassah CollegeSpring 2004
Branch Prediction ⎯ 3
On subsequent executions of the same branch instruction:When branch instruction enters D1,
D1 decoder reads the prediction from the BTBOn a Not Taken prediction, the next instruction in the
Sequential Prefetch Buffer is sent to D1On a Taken prediction, the Prediction Prefetch Buffer
prefetches and sends instructions to D1When branch instruction enters EX,
The branch is verified as Taken or Not TakenOn correct prediction,
U-pipe and V-pipe continueBTB entry is updated (history bits adjusted up or down)
On mispredictionBoth pipelines are flushedBTB entry is updated (history bits and branch target)
Microprocessors (A)From the 386 to the Pentium 422
Dr. Martin LandHadassah CollegeSpring 2004
For typical loops, branches are mispredicted: On the first run (BTB miss ⇒ mispredicted as not taken) On the last run (mispredicted as taken)
Example:Loop runs 400h = 102410 timesOn first run of JLE FOO
BTB miss ⇒ mispredicted as not taken3 stall cycles for pipeline flushBTB entry for JLE FOO as strongly taken
On next 1022 runs of JLE FOOBTB correctly predicts as taken with no stall cycles
On last run of JLE FOOBTB hit ⇒ mispredicted as taken3 stall cycles for pipeline flush
Branch Prediction ⎯ 4
MOV [EBP-02], 0001 FOO: INC [EBP-02] CMP [EBP-02], 0400 JLE FOO NEXT: ADD EAX, EBX
Microprocessors (A)From the 386 to the Pentium 423
Dr. Martin LandHadassah CollegeSpring 2004
Branch Prediction ⎯ 5
MOV [EBP-02], +01 FOO: MOV [EBP-04], +00 BAR: INC [EBP-04] CMP [EBP-04], +03 JLE BAR INC [EBP-02] CMP [EBP-02], +03 JLE FOO NEXT: ADD EAX, EBX SUB EDX, ECX ADD EAX, EBX
Example with nested loops:
On first run, JLE BAR misses in BTBMispredicted as not takenNew BTB entry as strongly taken3 stall clocks
On following runs, JLE BAR predicted as takenCorrectly predicted, until end of loop
No stall clocks
At end of inner loop, Flushed with 3 stall clocksMarked weakly taken in BTB
On next FOO loop, JLE BAR predicted as (weakly) takenCorrectly predicted, until end of loopAt end of inner loop,
Flushed with 3 stall clocksMarked weakly taken in BTB
Microprocessors (A)From the 386 to the Pentium 424
Dr. Martin LandHadassah CollegeSpring 2004
Integrated On-Chip Split Cache — 1
Separate code and data caches integrated on-chip Each cache is 8 Kbytes in size32-byte line (block) size 2-way set associativeEach cache has a dedicated TLB
TLB = translation look-aside buffer which caches linear address to physical address translations
Microprocessors (A)From the 386 to the Pentium 425
Dr. Martin LandHadassah CollegeSpring 2004
Integrated On-Chip Split Cache — 2
Data cache has two ports (one for each pipe)The cache tags are triple portedAllow three simultaneous inquire cycles:
u-pipe, v-pipe and I/O unit
Code cache closely integrated withBranch prediction hardwarePrefetch buffers
Not all main memory must be (can be) cachedInstructions can be fetched from code cache or
directly from main memory
Microprocessors (A)From the 386 to the Pentium 426
Dr. Martin LandHadassah CollegeSpring 2004
MMX
Microprocessors (A)From the 386 to the Pentium 427
Dr. Martin LandHadassah CollegeSpring 2004
MMX™ Technology Programming Environment
MMX = Multimedia ExtensionsVector extensions to 32-bit Intel Architecture (IA32)SIMD execution model
Single-instructionMultiple-data
MMX ALU integrated into Pentium pipelineMMX Instructions added to ISA
No new mode or operating system visible stateAll existing software runs as before
Microprocessors (A)From the 386 to the Pentium 428
Dr. Martin LandHadassah CollegeSpring 2004
Single Instruction, Multiple Data (SIMD) Execution Model
Similar to Very Long Instruction Word (VLIW) machine
Microprocessors (A)From the 386 to the Pentium 429
Dr. Martin LandHadassah CollegeSpring 2004
Add Packed
PADDB — Add Packed Signed Bytes with Wraparound
PADDSB — Add Packed Signed Bytes with Signed Saturation
PADDUB — Add Packed Signed Bytes with Unsigned Saturation
SRC 63 56 55 4847 40 39 32 31 2423 16 15 8 7 0 +
DEST 63 56 55 4847 40 39 32 31 2423 16 15 8 7 0
DEST 63 56 55 4847 40 39 32 31 2423 16 15 8 7 0
Microprocessors (A)From the 386 to the Pentium 430
Dr. Martin LandHadassah CollegeSpring 2004
Packed Multiply and Add (pmadd)
DEST[31..0] ← (DEST[15..0] × SRC[15..0]) + (DEST[31..16] × SRC[31..16])
DEST[63..32] ← (DEST[47..32] × SRC[47..32]) + (DEST[63..48] × SRC[63..48])
Microprocessors (A)From the 386 to the Pentium 431
Dr. Martin LandHadassah CollegeSpring 2004
P6 Architecturefor
Pentium II, III, 4
Microprocessors (A)From the 386 to the Pentium 432
Dr. Martin LandHadassah CollegeSpring 2004
P6 Architecture
New hardware architectureSupports IA32 Instruction Set Architecture (ISA)
Instructions, registers, data types, addressing modes, etc.
From outside, P6 looks like any other IA32 machine
Internal operations in a RISC core machineFirst introduced in the Pentium Pro (1995)Architectural basis for Pentium II, III, and 4
Microprocessors (A)From the 386 to the Pentium 433
Dr. Martin LandHadassah CollegeSpring 2004
Main P6 Architecture Features
Internal RISC core machineIA32 instructions recompiled to RISC ISA
Greater ILP than in Pentium ILP = Instruction Level Parallelism
Deeper branch prediction than in Pentium Larger branch cache in BTB
Out-of-order instruction executionInstructions run through pipeline
In most convenient orderNot in the program listing order
Microprocessors (A)From the 386 to the Pentium 434
Dr. Martin LandHadassah CollegeSpring 2004
P6 Architecture Subsystems
Microprocessors (A)From the 386 to the Pentium 435
Dr. Martin LandHadassah CollegeSpring 2004
P6 Instruction Fetch
I/OOperations Memory
Access
Fetch
Instruction Cache Updates
Data Cache Updates
Microprocessors (A)From the 386 to the Pentium 436
Dr. Martin LandHadassah CollegeSpring 2004
P6 Instruction Pool
PoolingPool of micro-operationswhich can be executed inany convenient order
Microprocessors (A)From the 386 to the Pentium 437
Dr. Martin LandHadassah CollegeSpring 2004
P6 Instruction Execution
Find an Instruction Ready to Execute
Return Executed
Instruction to Pool
Data Reads
Microprocessors (A)From the 386 to the Pentium 438
Dr. Martin LandHadassah CollegeSpring 2004
P6 Retirement
Retire FinishedInstructions in
Original Program Order
Data Writes Register Updates
Microprocessors (A)From the 386 to the Pentium 439
Dr. Martin LandHadassah CollegeSpring 2004
Memory Subsystem
Processing Units Fetch/Decode UnitDispatch/Execute UnitRetire UnitInstruction Pool
IA 32 Registers
P6 Subsystems
Microprocessors (A)From the 386 to the Pentium 440
Dr. Martin LandHadassah CollegeSpring 2004
Memory Subsytem
System BusExternal computer memory bus Connection to main RAM36-bit address bus (physical address space of 64
GBytes)L2 cache ⎯ unified 256 KBBus Interface Unit (BIU) ⎯ controls L1 access to L2
and RAML1 cache
8 KB data8 KB instruction
Microprocessors (A)From the 386 to the Pentium 441
Dr. Martin LandHadassah CollegeSpring 2004
IA32 Register File
IA32 instruction set defines familiar registersStandard register set since 386Using IA32 registers is required for instruction set
compatibilityIA32 registers are used as P6 source and
destination operandsInternal calculations use a larger RISC-type register
set (not visible to programmer)
Microprocessors (A)From the 386 to the Pentium 442
Dr. Martin LandHadassah CollegeSpring 2004
Fetch/Decode Units Fetches IA 32 instructionsConverts each IA32 instruction to
one or more (RISC-type) micro-opsPlaces independent micro-ops into Instruction Pool
Dispatch/Execute Unit executes micro-opsIdentifies dependenciesPerforms branch predictionChooses instructions which are ready to executeReturns results to Instruction Pool
Retire UnitConverts micro-op results back to IA 32 formatPreserves original program orderUpdates IA 32 registers
Processing Units
Microprocessors (A)From the 386 to the Pentium 443
Dr. Martin LandHadassah CollegeSpring 2004
Micro-Operations (Micro-Ops)
Independent RISC-like primitive instructions Triadic instructions
Two logical sources and one logical destination“logical source” = not visible to programmer
Each simple IA32 instruction is converted into one micro-op (example: MOV AX,BX)
Complex instructions are decoded into from 2 to 4 micro-ops (example: MOV AX,[BX+SI+78])
Very complex instructions decoded into preprogrammedmicro-op sequences (example: SQRT)
Microprocessors (A)From the 386 to the Pentium 444
Dr. Martin LandHadassah CollegeSpring 2004
Register Alias Table (RAT)
Last stage in decoding processAliases IA 32 register references to GP registersAdds status bits to micro-ops to aid schedulingPasses micro-ops to the Instruction PoolNo instruction reordering yetMicro-op stream is a RISC-equivalent of the decoded
IA32 instruction stream
Microprocessors (A)From the 386 to the Pentium 445
Dr. Martin LandHadassah CollegeSpring 2004
Dynamic Execution
Micro-ops not executed in original program orderMicro-ops are executed when ready
All source operands are available
Requires three conceptual ingredients:
• Deep Branch Prediction• Dynamic Data Flow Analysis • Speculative Execution
Microprocessors (A)From the 386 to the Pentium 446
Dr. Martin LandHadassah CollegeSpring 2004
Deep Branch Prediction
Extends Pentium branch prediction:Predicts branches to several nested levelsRequires larger statistical record than Pentium
Implemented in instruction fetch/decode unitIncludes branches, calls, and interrupts
Microprocessors (A)From the 386 to the Pentium 447
Dr. Martin LandHadassah CollegeSpring 2004
Dynamic Data Flow Analysis
Monitors micro-opsLooks for data and register dependenciesLocates any micro-ops ready for execution
(Ready = all source operands are available)Enables out-of-order executionKeeps the execution units busy
Microprocessors (A)From the 386 to the Pentium 448
Dr. Martin LandHadassah CollegeSpring 2004
Speculative Execution
Execute instructions ahead of the program counterExecute instructions before “normal fetch” timeBranch Prediction determines most likely instructions
for executionStore results in temporary registersSome executed instructions will never be usedCommit the result of each instruction
Only if the speculation is a correct predictionIn the original program order
Microprocessors (A)From the 386 to the Pentium 449
Dr. Martin LandHadassah CollegeSpring 2004
Register Dependencies
IA 32 has 8 “general purpose” registersSmall register set can cause data hazard stalls
MOV BX, [SI+1234]ADD BX, [BX]
Decoding to micro-ops aliases IA-32 registers40 general purpose 32-bit registers in RISC coreDecoder assigns a RISC register to an IA-32 registerCan assign multiple GP registers to one IA-32 registerCan prevent dependenciesHandle integers and floating point data
Microprocessors (A)From the 386 to the Pentium 450
Dr. Martin LandHadassah CollegeSpring 2004
Register Alias Table (RAT)
Last stage in decoding processAliases IA 32 register references to GP registersAdds status bits to micro-ops to aid schedulingPasses micro-ops to the Instruction PoolNo instruction reordering yetMicro-op stream is a RISC-equivalent of the decoded
IA32 instruction stream
Microprocessors (A)From the 386 to the Pentium 451
Dr. Martin LandHadassah CollegeSpring 2004
Pentium II, III, 4
Microprocessors (A)From the 386 to the Pentium 452
Dr. Martin LandHadassah CollegeSpring 2004
Pentium II
Pentium Pro with MMX™ Technology
Pipeline Sections RenamedFetch/Decode Unit → In-Order Issue Front-endDispatch/Execute Unit → Out-of-Order CoreRetire Unit → In-Order Retirement unit
Microprocessors (A)From the 386 to the Pentium 453
Dr. Martin LandHadassah CollegeSpring 2004
Pentium III
Maintains the P6 architectureSupports all IA-32 features up to Pentium II
Pentium II with Streaming SIMD Extensions (SSE)Floating Point version of MMXSingle Instruction Multiple Data (vector) FPU
Note: • All RISC-type processors perform integer instructions
very efficiently. • As multimedia programming became more important in
the 1990s, the measure of processor speed shifted to Floating Point efficiency.
Microprocessors (A)From the 386 to the Pentium 454
Dr. Martin LandHadassah CollegeSpring 2004
Pentium 4
Maintains the P6 architectureSupports all IA-32 features up to Pentium IIIRedesign of the P6 pipeline model
Netburst Micro-ArchitectureSuperpipeliningDeeper Branch PredictionFront End Pipeline Cache SubsystemQuad-Pumped I/O Bus
Microprocessors (A)From the 386 to the Pentium 455
Dr. Martin LandHadassah CollegeSpring 2004
Superpipelining TechniqueDivide each stage into 2 stages:
• Each stage does half the work • Each stage requires half the time • Double the clock rate (divide the clock cycle time): τ → τ/2
1 2 3 4 5 6 7 I1 S1 S2 S3 S4 S5 I2 S1 S2 S3 S4 S5 I3 S1 S2 S3 S4 S5
1 2 3 4 5 6 7 8 9 10 11 12 I1 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 I2 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 I3 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
τICτICCPITimeRunIC
4ICninstructio
cyclesCPI
largeICidealideal
normal
largeICidealnormal
×⎯⎯⎯ →⎯××=−
⎯⎯⎯ →⎯+
==
→
→ 1
idealnormallargeIC
idealsuper
idealsuper
largeICidealsuper
TimeRunτ
ICτ
ICCPITimeRun
IC9IC
ninstructiocycles
CPI
−=×⎯⎯⎯ →⎯××=−
⎯⎯⎯ →⎯+
==
→
→
21
22
1
Microprocessors (A)From the 386 to the Pentium 456
Dr. Martin LandHadassah CollegeSpring 2004
Superpipelining in Pentium 4Rapid Execution Engine
Higher clock speed
Hyper Pipelined Technology20 stage pipeline (double the Pentium III pipeline length)Each stage does less processing work
Typical instruction requires same processing timeHalf the time in each stageDouble the number of stages
Ideally, doubles number of instructions finished per secondFinish one instruction per cycleTwice the cycles per second
Microprocessors (A)From the 386 to the Pentium 457
Dr. Martin LandHadassah CollegeSpring 2004
Deeper Branch Prediction
Expanded Branch Target Buffer (BTB)4 K-entries Was 256 in Pentium
Expanded Instruction Pool 126 instructions in various stages of executionWas 40 in Pentium Pro
Improved branch prediction algorithm
Microprocessors (A)From the 386 to the Pentium 458
Dr. Martin LandHadassah CollegeSpring 2004
New Instruction Cache Subsystem
Called Front End PipelineIA-32 Instruction Cache is extended to 128 byte line-sizeWas 32 bytes in Pentium II/III
Caching for Micro-opsTrace = micro-op sequence for one IA-32 instructionA Trace Cache stores decoded micro-op tracesLoops re-use cached tracesSkips additional decode of same IA-32 instructions
Microprocessors (A)From the 386 to the Pentium 459
Dr. Martin LandHadassah CollegeSpring 2004
Quad-Pumped I/O Bus
New organization of I/O bus
Bus cycles determined by 100 MHz clockCan make 4 transfers per bus cycle
4 transfers/cycle × 100 MHz = 400 M-Transfers per second
Data bus width of 8 bytes (64 bits)8 bytes/transfer × 400 M-Transfers per second
= 3200 MB/second = 3.2 GB/second
Microprocessors (A)From the 386 to the Pentium 460
Dr. Martin LandHadassah CollegeSpring 2004
SuperpipeliningRapid Execution Engine
Higher clock speedALU operations take ½ clock cycle
Hyper Pipelined Technology20 stage pipeline (double the Pentium III pipeline length)Each stage does less processing work
Typical instruction requires same processing timeHalf the time in each stageDouble the number of stages
Ideally, doubles number of instructions finished per secondFinish one instruction per cycleTwice the cycles per second
Microprocessors (A)From the 386 to the Pentium 461
Dr. Martin LandHadassah CollegeSpring 2004
Pentium 4 Performance IssuesAdvertising
Pentium 4 processors have very high clock speeds Range from 1.4 GHz to 4 GHz
RealityHigher clock speeds result from superpipeliningClock speed has a different meaning than in P II/III
How should we compare clock speeds?Expectation
1.5 GHz processor is 50% faster than same processor at 1.0 GHz
Measurement1.5 GHz Pentium 4 is 20% faster than 1.0 GHz Pentium III
Microprocessors (A)From the 386 to the Pentium 462
Dr. Martin LandHadassah CollegeSpring 2004
Problems With SuperpipeliningNot all operations can be divided into smaller stages
PUSH/POP can easily be superpipelined: Split single PUSH stage into
1. SP-- stage 2. [SP] ← value stage
IMUL/DIV/CMP may be harder to splitSome stalls depend on clock cycles and on real time
Pipeline flush: Clock runs at twice the speedSuperpipeline
Twice as many instructions in pipelineTwice as many wasted pipeline cycles were run and cancelled
Pipeline flush penalty does not scale with clock frequency
Cache penalties depend on reaction time of memory
Microprocessors (A)From the 386 to the Pentium 463
Dr. Martin LandHadassah CollegeSpring 2004
CPIstall and Effective Clock RateSuppose that, on average, every 2nd instruction will stall for 1 cycle
CPIstall (Pentium-4) ≈ 0.5 cycles/instruction (plus Pentium-III stalls)CPItotal (Pentium-4) ≈ 1.5 cycles/instruction (plus Pentium-III stalls)
Pentium-4 clock speed ≈ (Pentium-III clock speed)/1.5
Pentium-4 with 1.5 GHz clock has effective clock of a Pentium-III with 1.0 GHz clock
( ) ( )
( ) Rate Clock Effective , CPI1R
R
R1
ICCPI1R1
ICR1
ICCPI1
R1
ICCPIτICCPITimeRun
stallsuper
effective
effective
stallsuper
stallsuper
totalsuper
totalsuper
totalsuper
+=
×=+××=××+=
××=××=−
Microprocessors (A)From the 386 to the Pentium 464
Dr. Martin LandHadassah CollegeSpring 2004
Fair Comparison
Accounting for the different meanings of clock speed:
Compared to the 1.0 GHz Pentium-III, 1.5 GHz Pentium-4 is 20% faster on SPECint2000 1.5 GHz Pentium-4 is 75% faster on SPECFP2000
Speed-up is result of the architectural enhancementsA very reasonable performance improvement
Code compiled with Pentium-4 optimization is faster than older code
Microprocessors (A)From the 386 to the Pentium 465
Dr. Martin LandHadassah CollegeSpring 2004
Intel ItaniumIA-64
Microprocessors (A)From the 386 to the Pentium 466
Dr. Martin LandHadassah CollegeSpring 2004
Itanium OverviewIntel's 64-bit architectural plan*
Goals of the Itanium architectureSupport 64-bit addressesIA-32 backward compatibilityIncrease instruction level parallelism (ILP)Improve branch handlingReduce hardware burden using compile-time informationImprove floating point performance
New MethodologyExplicitly Parallel Instruction Computing (EPIC)
Compiler identifies instruction dependenciesCompiler reschedules instructions for optimized executionCompiler groups instructions for parallel issue to Execution Units
Hard work done once by compiler (not each time by hardware)
Microprocessors (A)From the 386 to the Pentium 467
Dr. Martin LandHadassah CollegeSpring 2004
Operating Environments
Microprocessors (A)From the 386 to the Pentium 468
Dr. Martin LandHadassah CollegeSpring 2004
Data Types
Pointers: 8 bytesInteger:
1, 2, 4, and 8 bytesbyte, word, doubleword, quadword
Floating Point: single, double and double-extended formats
Microprocessors (A)From the 386 to the Pentium 469
Dr. Martin LandHadassah CollegeSpring 2004
New Features in Itanium Instruction Set
RISC-like syntaxLoad-Store architectureUniform instruction length (41 bits)
Explicit instruction parallelismCompiler chooses instructions to run in parallelCompiler provide hints to the processorPredication replaces branching
More flexible use of registers128 integer and floating-point registersRegister renaming replaces “spill and fill” on callsRegister rotation allows parallelization of loops
Microprocessors (A)From the 386 to the Pentium 470
Dr. Martin LandHadassah CollegeSpring 2004
Instruction Format
General Syntax:
[(qp)] mnemonic[.comp1][.comp2] dests = srcs
(qp) qualifying predicate register mnemonic name identifying the instruction [comp1][comp2] Completers indicate optional variations on basic
mnemonic dests, srcs source operands are registers or immediates
destination is typically a register
Microprocessors (A)From the 386 to the Pentium 471
Dr. Martin LandHadassah CollegeSpring 2004
Instruction Format Examples
Simple Instructionadd r1 = r2, r3
r1 ← r2 + r3Instruction with Immediate
add r1 = r2, r3, 1r1 ← r2 + r3 + 1
Instruction with Completercmp.eq p3 = r2, r4
if (r2 eq r4) then p3 ← 1Predicated Instruction
(p4) add r1 = r2, r3if (p4=1) then {r1 ← r2 + r3}if (p4=0) then {NOP}
Microprocessors (A)From the 386 to the Pentium 472
Dr. Martin LandHadassah CollegeSpring 2004
Predication Replaces Conditional Branches
Conditional execution of predicated instructions
Example: if (p5) r1 = r2 + r3Executes ADD if p5 = 1Executes NOP if p5 = 0
Predicate registers64 predicate registers: pr0 ⎯ pr63Set/Clear by compare instructions
Advantages over conditional branchEliminate misprediction penalties Allow larger parallel instruction blocks (no dependencies)
Microprocessors (A)From the 386 to the Pentium 473
Dr. Martin LandHadassah CollegeSpring 2004
Predication Example
High level codeif (a > b) c = c + 1else d = d * e + f
Predicated codepT, pF = compare(a > b)if (pT) c = c + 1if (pF) d = d * e + f
Compare sets pT or pFCompiler schedules the two if instructions in parallel
No conditional branchNo misprediction penalty on either outcome
Microprocessors (A)From the 386 to the Pentium 474
Dr. Martin LandHadassah CollegeSpring 2004
Explicitly Parallel Instruction Computing (EPIC)
Very Long Instruction Word (VLIW) formatInstruction Bundle: 3 instructions in a VLIWInstruction Group: 1 or more instruction bundles
Instruction Group: No data dependencies among instructionsMay be executed in parallel (according to program logic)
At compile time, compiler Identifies data dependenciesForms Instruction BundlesMarks Instruction GroupsDetermines ordering of instruction execution
Microprocessors (A)From the 386 to the Pentium 475
Dr. Martin LandHadassah CollegeSpring 2004
Instruction Bundles
Instructions BundleThree Instructions and a Template Field16 byte length: 3 × 41 bits + 5 bits = 128 bitsAligned at 16-byte boundaries in memoryContain no RAW or WAW dependencies
Template FieldMaps each instruction to Execution Unit typeInteger (I), Float (F), Memory (M), Branch (B)Processor executes the three instructions in parallel
Microprocessors (A)From the 386 to the Pentium 476
Dr. Martin LandHadassah CollegeSpring 2004
Instruction Groups
Instruction GroupSequence of Instruction bundlesInstructions without RAW or WAW dependencies At least one instructionNumber of instructions is not limited
Instruction groups end atBranch instructionsCycle Breaks (;;)
Inserted by compiler to indicate data hazards (dependencies)
Processor seeks to execute all bundles in a group in parallel
Microprocessors (A)From the 386 to the Pentium 477
Dr. Martin LandHadassah CollegeSpring 2004
Instruction Groups ⎯ Example
r1 = r2 + r3r4 = r5 + r6r7 = r8 + r9 ;;r10 = r4 + r11
Microprocessors (A)From the 386 to the Pentium 478
Dr. Martin LandHadassah CollegeSpring 2004
General Registers
General Registers128 registers ⎯ GR0 through GR12764-bit widthNaT (Not a Thing) bit
Mark deferred speculative exceptionsTwo Subsets
GR0 ⎯ GR31: Static General RegistersGR0 always holds zero as source
Write to GR0 causes an Illegal Operation fault
GR32 ⎯ GR127: Stacked General RegistersAvailable to application program Acquire by allocating a Register Stack FrameAct as Local and Output Registers
Microprocessors (A)From the 386 to the Pentium 479
Dr. Martin LandHadassah CollegeSpring 2004
“Spilling” and “Filling” Problem
In an IA-32 procedure callCalling procedure uses IA-32 registersCalled procedure needs the same registers
Procedure call causes many memory accessesCalling procedure saves register values in memoryCalling task passes parameters by pushing to stackCalled task returns by register or by stackCalling task restores its previous register state
Microprocessors (A)From the 386 to the Pentium 480
Dr. Martin LandHadassah CollegeSpring 2004
Stacked General Registers
Stacked General RegistersProcedure Calls use Temporary Registers Avoids “spilling” and “filling”
Register Frame allocated to a nested procedureAllocate up to 96 registers from (GR32 … GR127)
Specify number of required registers for:Local ⎯ private registers for use by procedureOutput ⎯ parameter/return passing
Microprocessors (A)From the 386 to the Pentium 481
Dr. Martin LandHadassah CollegeSpring 2004
Implementation of Stacked General RegistersOn procedure calls and returns
Allocate temporary physical registersCurrent Frame Marker (CFM)
Stacked General Registers allocated to called procedurePrevious Frame Marker (PFM)
Stacked General Registers allocated to calling proceduresof ⎯ size of frame (local + output registers)sol ⎯ size of local
Implementation in hardwareInvisible to application programsRename temporary registers to standard register set
Called procedure always sees: 32 Static General Registers: GR0 ⎯ GR31Stacked General Registers: GR32, GR33, ... , GR 32+sof
Microprocessors (A)From the 386 to the Pentium 482
Dr. Martin LandHadassah CollegeSpring 2004
Stacked General Registers ⎯ Example
Microprocessors (A)From the 386 to the Pentium 483
Dr. Martin LandHadassah CollegeSpring 2004
Register Rotation
Modulo loop scheduling Execute loop iterations in parallelLoop iteration starts before previous iteration finishesTraditionally requires loop unrolling
Write repeated code instances, instead of writing a loop
Register Rotation Use multiple physical registersRename multiple registers to same nameProvide every iteration with its own set of registers
Each instance of loop sees same register namesAvoids unrolling
Microprocessors (A)From the 386 to the Pentium 484
Dr. Martin LandHadassah CollegeSpring 2004
Virtual Addressing by Region
IA-64 address space 3-bit Virtual Region Number (VRN) 61-bit address within Virtual Region (VR)
261 byte address space in VRDivided into pages by the OS
3-bit Virtual Region Number (VRN)VRN is an index into a Region Register Table (RRT) RRT defines 8 Virtual Region Identifier (VRI) entries24-bit VRI identifies one of 224 Virtual Regions
Total address space = 264 – 3 + 24 = 285 bytes
Microprocessors (A)From the 386 to the Pentium 485
Dr. Martin LandHadassah CollegeSpring 2004
Virtual Addressing
Microprocessors (A)From the 386 to the Pentium 486
Dr. Martin LandHadassah CollegeSpring 2004
Virtual Addressing64-bit address divided into 3 fields
3-bit VRN points to 1 of 8 Virtual RegionsVR has 61-bit address (64-3 = 61), divided into pages by the OS
Supported page sizes4KB, 8KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB, 64MB, 256MB, 4GBPage Offset = 12, 13, 14, 16, 18, 20, 22, 24, 28, 30, 32 bits
VPN = 49, 48, 47, 45, 43, 41, 39, 37, 33, 31, 29 bits
Effective IA-64 address space is 85-bits:64-3+24=85
3 bits OS dependent OS dependent
Virtual Region Number Virtual Page Number Page Offset
Microprocessors (A)From the 386 to the Pentium 487
Dr. Martin LandHadassah CollegeSpring 2004
Itanium 2
Itanium with enhancements:• 6 integer execution units (up from 4)• 2 Load + 2 Store units (up from 2 Load/Store)• Move L3 cache onto silicon die (on chip)• I/O clock is 400 MHz (up from 266 MHz)• 128-bit data I/O (up from 64-bit)• I/O rate of 6.4 GB/s (up from 2.1 GB/s)