1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4...

Microprocessors (A)From the 386 to the Pentium 41

Dr. Martin LandHadassah CollegeSpring 2004

Intel Processors

from 386 to Pentium 4



386



Intel 80386 Microprocessor



BusInterface

Unit

Address

Data

Control

PagingUnit

PhysicalAddress

ShadowRegisters

SegmentationUnit

LinearAddress

InstructionPrefetch

InstructionDecoderDecode and

Sequencing

ALU

Registers

Effe

ctiv

e Ad

dres

s (O

ffset

)

Code

Str

eam

:Li

near

byt

ese

quen

ce fr

om C

S

CodeStreamCode

Addr

ess

Dis

plac

emen

ts

MicroCode

StatusFlags

ALU (Data) Bus

Simplified 386 Microprocessor

Prefetch loads instruction bytes whenever there are no data accesses.Decoder identifies instruction boundaries and sends displacements to

Address Management.Decode/Sequence generates microcode for instruction execution.ALU sends Effective Address to Address Management for data access.Address Management handles segmentation and paging.Registers are updated in the last step.



BusInterface

Unit

Address

Data

Control

PagingUnit

PhysicalAddress

ShadowRegisters

SegmentationUnit

LinearAddress

InstructionPrefetch

InstructionDecoderDecode and

Sequencing

ALU

Registers

Effe

ctiv

e Ad

dres

s (O

ffset

)

Code

Str

eam

:Li

near

byt

ese

quen

ce fr

om C

S

CodeStream

Code

Addr

ess

Dis

plac

emen

ts

MicroCode

StatusFlags

ALU (Data) Bus

Problems with Pipelining the 386

No Internal Data CacheAll data accesses are external (slow)Unified memory access causes structural hazard on data accesses

Instruction dependenciesLoad+ALU operations stall during load of data operandsConditional branches read status flags set by ALU instructionsLoad+ALU operations use register-based pointers (depend on previous

write-backs)Branches cause a flush of the Instruction Prefetch queue.



486



Upgrade of 80386

New Features in 486:• Pipelines 386 instruction execution• Floating-Point Unit (FPU) integrated on-chip• 8 or 16 KB L1 data cache on chip • Support for external L2 data cache• Multiprocessor support• Support for battery operated notebook PC



Pipeline Organization

Each pipeline stage executes in one clock cycle

InstructionFetch

InstructionMemory

Stage-1Decode

Stage-2Decode Execute

DataMemory

WriteBack

AddressInstruction AddressData

Forwarding



Five Pipelined StagesInstruction Prefetch (PF)Stage-1 Decode (D1)

Instruction IdentificationIdentify source operands:

Identify Register source Calculate Effective Address for Data Memory (cache) source

Stage-2 Decode (D2)Complete complex Effective Addresses Generation of Microcode

Execution (EX)Integer ALUFP ALUData memory writes

Write to fast memory bufferBuffer updates cache

Register Writeback (WB)



486 Internal Organization

Bus Interface Unit (BIU)

Instruction Prefetch

Cache

Decoder

MMUALU

FPU



Intel Architecture Floating Point Unit (FPU)

8087 numeric processorSeparate 8086 integer CPU and 8087 FPU

387 DX and SX math coprocessorsImplement the final IEEE STD 754Added new trigonometric instructions

486 processor FPUOn-chip equivalent of the Intel 387 DX math coprocessor IEEE STD 754

Pentium FPUCompletely redesigned FPU Conformance to both the IEEE STD 754 and 854Algorithms with three times the performance of 486Shortcut cost Intel a lot of money to correct



Support for Battery Operated Notebook PC System Management Mode (SMM)

Special purpose interruptAddress space for storing processor stateTransparent to OS and applications software

Stop Clock StatesInitiated by external signal (hardware control)“Fast Wake-Up” Stop Grant state

Stops processor I/O operations“Slow Wake-Up” Stop Clock state

CLK frequency → 0 MHzAuto Halt Power Down

Similar to Stop ClockInitiated by HALT instruction (software control)

Dynamic Local Power ManagementSubsystems switch themselves off when not needed



Pentium



Superscalar Architecture

Two integer instruction pipelines“U” pipe can execute all integer instructions

“V” pipe can execute “simple integer” instructions

Floating Point Unit integrated with integer pipelinesEach pipeline can issue most instructions in

one clock cycle

Instruction issue: instruction execution stage (after fetch and decode have completed)



Instruction Pairing

Process of issuing two instructions in parallelWhen instructions are paired:

First instruction issued to the U-pipeNext sequential instruction issued to V-pipe

Pairing not possible if:The instructions have dependenciesEither instruction is complex

1 2 3 4 51 2 3 4 5

U-pipeV-pipe

I1

I2

I3

I4



Pipeline Stages for Pentium (without MMX)

PF D1 D2 EX WB Prefetch Instruction

Decode Address Generate

Execution Write Back

PFRAMD2

D2

EX

EX

WB

WB

U

V

Pipeline stages are very similar to 486 stages (not identical)

D1

D1



Pentium Block Diagram



Integer Instruction Pairing Rules

Pairing: two instructions issued on the same clockcycle (one to U-pipe and one to V-pipe)

Pairing requires the following conditions:1. Both instructions must be “simple”2. No RAW or WAW register dependencies between

instructions3. Register dependencies include pointers and flags4. Neither instruction contains both displacement and

immediate



Branch Prediction ⎯ 1 Branch Target Buffer (BTB) is a special cache that stores information

about branch instructions:Source address (identifies particular branch instruction)Target address (“jump to” address)2 History bits provide 4 states: (11) strongly taken

(10) weakly taken(01) weakly not taken(00) strongly not taken

On a branch instruction,BTB makes predictions about branches:

Branch Taken or Branch Not Taken (by high order history bit)Target address (if Taken)

D1 decoder (Stage 2) reads prediction from the BTBInstructions are fetched according to predictionBranches that “miss” in BTB are treated as not taken



Branch Prediction ⎯ 2

Branch Predictions are verified in EX or WB

On first verification of a branch instruction:

If Not Taken, no BTB entry is made

If Taken, the BTB creates a new entry:Instruction address of branch instructionBranch target addressPrediction that branch is strongly taken




On subsequent executions of the same branch instruction:When branch instruction enters D1,

D1 decoder reads the prediction from the BTBOn a Not Taken prediction, the next instruction in the

Sequential Prefetch Buffer is sent to D1On a Taken prediction, the Prediction Prefetch Buffer

prefetches and sends instructions to D1When branch instruction enters EX,

The branch is verified as Taken or Not TakenOn correct prediction,

U-pipe and V-pipe continueBTB entry is updated (history bits adjusted up or down)

On mispredictionBoth pipelines are flushedBTB entry is updated (history bits and branch target)



For typical loops, branches are mispredicted: On the first run (BTB miss ⇒ mispredicted as not taken) On the last run (mispredicted as taken)

Example:Loop runs 400h = 102410 timesOn first run of JLE FOO

BTB miss ⇒ mispredicted as not taken3 stall cycles for pipeline flushBTB entry for JLE FOO as strongly taken

On next 1022 runs of JLE FOOBTB correctly predicts as taken with no stall cycles

On last run of JLE FOOBTB hit ⇒ mispredicted as taken3 stall cycles for pipeline flush


MOV [EBP-02], 0001 FOO: INC [EBP-02] CMP [EBP-02], 0400 JLE FOO NEXT: ADD EAX, EBX




MOV [EBP-02], +01 FOO: MOV [EBP-04], +00 BAR: INC [EBP-04] CMP [EBP-04], +03 JLE BAR INC [EBP-02] CMP [EBP-02], +03 JLE FOO NEXT: ADD EAX, EBX SUB EDX, ECX ADD EAX, EBX

Example with nested loops:

On first run, JLE BAR misses in BTBMispredicted as not takenNew BTB entry as strongly taken3 stall clocks

On following runs, JLE BAR predicted as takenCorrectly predicted, until end of loop

No stall clocks

At end of inner loop, Flushed with 3 stall clocksMarked weakly taken in BTB

On next FOO loop, JLE BAR predicted as (weakly) takenCorrectly predicted, until end of loopAt end of inner loop,

Flushed with 3 stall clocksMarked weakly taken in BTB



Integrated On-Chip Split Cache — 1

Separate code and data caches integrated on-chip Each cache is 8 Kbytes in size32-byte line (block) size 2-way set associativeEach cache has a dedicated TLB

TLB = translation look-aside buffer which caches linear address to physical address translations



Integrated On-Chip Split Cache — 2

Data cache has two ports (one for each pipe)The cache tags are triple portedAllow three simultaneous inquire cycles:

u-pipe, v-pipe and I/O unit

Code cache closely integrated withBranch prediction hardwarePrefetch buffers

Not all main memory must be (can be) cachedInstructions can be fetched from code cache or

directly from main memory



MMX



MMX™ Technology Programming Environment

MMX = Multimedia ExtensionsVector extensions to 32-bit Intel Architecture (IA32)SIMD execution model

Single-instructionMultiple-data

MMX ALU integrated into Pentium pipelineMMX Instructions added to ISA

No new mode or operating system visible stateAll existing software runs as before



Single Instruction, Multiple Data (SIMD) Execution Model

Similar to Very Long Instruction Word (VLIW) machine



Add Packed

PADDB — Add Packed Signed Bytes with Wraparound

PADDSB — Add Packed Signed Bytes with Signed Saturation

PADDUB — Add Packed Signed Bytes with Unsigned Saturation

SRC 63 56 55 4847 40 39 32 31 2423 16 15 8 7 0 +

DEST 63 56 55 4847 40 39 32 31 2423 16 15 8 7 0

DEST 63 56 55 4847 40 39 32 31 2423 16 15 8 7 0



Packed Multiply and Add (pmadd)

DEST[31..0] ← (DEST[15..0] × SRC[15..0]) + (DEST[31..16] × SRC[31..16])

DEST[63..32] ← (DEST[47..32] × SRC[47..32]) + (DEST[63..48] × SRC[63..48])



P6 Architecturefor

Pentium II, III, 4



P6 Architecture

New hardware architectureSupports IA32 Instruction Set Architecture (ISA)

Instructions, registers, data types, addressing modes, etc.

From outside, P6 looks like any other IA32 machine

Internal operations in a RISC core machineFirst introduced in the Pentium Pro (1995)Architectural basis for Pentium II, III, and 4



Main P6 Architecture Features

Internal RISC core machineIA32 instructions recompiled to RISC ISA

Greater ILP than in Pentium ILP = Instruction Level Parallelism

Deeper branch prediction than in Pentium Larger branch cache in BTB

Out-of-order instruction executionInstructions run through pipeline

In most convenient orderNot in the program listing order



P6 Architecture Subsystems



P6 Instruction Fetch

I/OOperations Memory

Access

Fetch

Instruction Cache Updates

Data Cache Updates



P6 Instruction Pool

PoolingPool of micro-operationswhich can be executed inany convenient order



P6 Instruction Execution

Find an Instruction Ready to Execute

Return Executed

Instruction to Pool

Data Reads



P6 Retirement

Retire FinishedInstructions in

Original Program Order

Data Writes Register Updates



Memory Subsystem

Processing Units Fetch/Decode UnitDispatch/Execute UnitRetire UnitInstruction Pool

IA 32 Registers

P6 Subsystems



Memory Subsytem

System BusExternal computer memory bus Connection to main RAM36-bit address bus (physical address space of 64

GBytes)L2 cache ⎯ unified 256 KBBus Interface Unit (BIU) ⎯ controls L1 access to L2

and RAML1 cache

8 KB data8 KB instruction



IA32 Register File

IA32 instruction set defines familiar registersStandard register set since 386Using IA32 registers is required for instruction set

compatibilityIA32 registers are used as P6 source and

destination operandsInternal calculations use a larger RISC-type register

set (not visible to programmer)



Fetch/Decode Units Fetches IA 32 instructionsConverts each IA32 instruction to

one or more (RISC-type) micro-opsPlaces independent micro-ops into Instruction Pool

Dispatch/Execute Unit executes micro-opsIdentifies dependenciesPerforms branch predictionChooses instructions which are ready to executeReturns results to Instruction Pool

Retire UnitConverts micro-op results back to IA 32 formatPreserves original program orderUpdates IA 32 registers

Processing Units



Micro-Operations (Micro-Ops)

Independent RISC-like primitive instructions Triadic instructions

Two logical sources and one logical destination“logical source” = not visible to programmer

Each simple IA32 instruction is converted into one micro-op (example: MOV AX,BX)

Complex instructions are decoded into from 2 to 4 micro-ops (example: MOV AX,[BX+SI+78])

Very complex instructions decoded into preprogrammedmicro-op sequences (example: SQRT)



Register Alias Table (RAT)

Last stage in decoding processAliases IA 32 register references to GP registersAdds status bits to micro-ops to aid schedulingPasses micro-ops to the Instruction PoolNo instruction reordering yetMicro-op stream is a RISC-equivalent of the decoded

IA32 instruction stream



Dynamic Execution

Micro-ops not executed in original program orderMicro-ops are executed when ready

All source operands are available

Requires three conceptual ingredients:

• Deep Branch Prediction• Dynamic Data Flow Analysis • Speculative Execution



Deep Branch Prediction

Extends Pentium branch prediction:Predicts branches to several nested levelsRequires larger statistical record than Pentium

Implemented in instruction fetch/decode unitIncludes branches, calls, and interrupts



Dynamic Data Flow Analysis

Monitors micro-opsLooks for data and register dependenciesLocates any micro-ops ready for execution

(Ready = all source operands are available)Enables out-of-order executionKeeps the execution units busy



Speculative Execution

Execute instructions ahead of the program counterExecute instructions before “normal fetch” timeBranch Prediction determines most likely instructions

for executionStore results in temporary registersSome executed instructions will never be usedCommit the result of each instruction

Only if the speculation is a correct predictionIn the original program order



Register Dependencies

IA 32 has 8 “general purpose” registersSmall register set can cause data hazard stalls

MOV BX, [SI+1234]ADD BX, [BX]

Decoding to micro-ops aliases IA-32 registers40 general purpose 32-bit registers in RISC coreDecoder assigns a RISC register to an IA-32 registerCan assign multiple GP registers to one IA-32 registerCan prevent dependenciesHandle integers and floating point data



Register Alias Table (RAT)

Last stage in decoding processAliases IA 32 register references to GP registersAdds status bits to micro-ops to aid schedulingPasses micro-ops to the Instruction PoolNo instruction reordering yetMicro-op stream is a RISC-equivalent of the decoded

IA32 instruction stream



Pentium II, III, 4



Pentium II

Pentium Pro with MMX™ Technology

Pipeline Sections RenamedFetch/Decode Unit → In-Order Issue Front-endDispatch/Execute Unit → Out-of-Order CoreRetire Unit → In-Order Retirement unit



Pentium III

Maintains the P6 architectureSupports all IA-32 features up to Pentium II

Pentium II with Streaming SIMD Extensions (SSE)Floating Point version of MMXSingle Instruction Multiple Data (vector) FPU

Note: • All RISC-type processors perform integer instructions

very efficiently. • As multimedia programming became more important in

the 1990s, the measure of processor speed shifted to Floating Point efficiency.



Pentium 4

Maintains the P6 architectureSupports all IA-32 features up to Pentium IIIRedesign of the P6 pipeline model

Netburst Micro-ArchitectureSuperpipeliningDeeper Branch PredictionFront End Pipeline Cache SubsystemQuad-Pumped I/O Bus



Superpipelining TechniqueDivide each stage into 2 stages:

• Each stage does half the work • Each stage requires half the time • Double the clock rate (divide the clock cycle time): τ → τ/2

1 2 3 4 5 6 7 I1 S1 S2 S3 S4 S5 I2 S1 S2 S3 S4 S5 I3 S1 S2 S3 S4 S5

1 2 3 4 5 6 7 8 9 10 11 12 I1 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 I2 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 I3 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

τICτICCPITimeRunIC

4ICninstructio

cyclesCPI

largeICidealideal

normal

largeICidealnormal

×⎯⎯⎯ →⎯××=−

⎯⎯⎯ →⎯+

==

→

→ 1

idealnormallargeIC

idealsuper

idealsuper

largeICidealsuper

TimeRunτ

ICτ

ICCPITimeRun

IC9IC

ninstructiocycles

CPI

−=×⎯⎯⎯ →⎯××=−

⎯⎯⎯ →⎯+

==

→

→

21

22

1



Superpipelining in Pentium 4Rapid Execution Engine

Higher clock speed

Hyper Pipelined Technology20 stage pipeline (double the Pentium III pipeline length)Each stage does less processing work

Typical instruction requires same processing timeHalf the time in each stageDouble the number of stages

Ideally, doubles number of instructions finished per secondFinish one instruction per cycleTwice the cycles per second



Deeper Branch Prediction

Expanded Branch Target Buffer (BTB)4 K-entries Was 256 in Pentium

Expanded Instruction Pool 126 instructions in various stages of executionWas 40 in Pentium Pro

Improved branch prediction algorithm



New Instruction Cache Subsystem

Called Front End PipelineIA-32 Instruction Cache is extended to 128 byte line-sizeWas 32 bytes in Pentium II/III

Caching for Micro-opsTrace = micro-op sequence for one IA-32 instructionA Trace Cache stores decoded micro-op tracesLoops re-use cached tracesSkips additional decode of same IA-32 instructions



Quad-Pumped I/O Bus

New organization of I/O bus

Bus cycles determined by 100 MHz clockCan make 4 transfers per bus cycle

4 transfers/cycle × 100 MHz = 400 M-Transfers per second

Data bus width of 8 bytes (64 bits)8 bytes/transfer × 400 M-Transfers per second

= 3200 MB/second = 3.2 GB/second



SuperpipeliningRapid Execution Engine

Higher clock speedALU operations take ½ clock cycle

Hyper Pipelined Technology20 stage pipeline (double the Pentium III pipeline length)Each stage does less processing work

Typical instruction requires same processing timeHalf the time in each stageDouble the number of stages

Ideally, doubles number of instructions finished per secondFinish one instruction per cycleTwice the cycles per second



Pentium 4 Performance IssuesAdvertising

Pentium 4 processors have very high clock speeds Range from 1.4 GHz to 4 GHz

RealityHigher clock speeds result from superpipeliningClock speed has a different meaning than in P II/III

How should we compare clock speeds?Expectation

1.5 GHz processor is 50% faster than same processor at 1.0 GHz

Measurement1.5 GHz Pentium 4 is 20% faster than 1.0 GHz Pentium III



Problems With SuperpipeliningNot all operations can be divided into smaller stages

PUSH/POP can easily be superpipelined: Split single PUSH stage into

1. SP-- stage 2. [SP] ← value stage

IMUL/DIV/CMP may be harder to splitSome stalls depend on clock cycles and on real time

Pipeline flush: Clock runs at twice the speedSuperpipeline

Twice as many instructions in pipelineTwice as many wasted pipeline cycles were run and cancelled

Pipeline flush penalty does not scale with clock frequency

Cache penalties depend on reaction time of memory



CPIstall and Effective Clock RateSuppose that, on average, every 2nd instruction will stall for 1 cycle

CPIstall (Pentium-4) ≈ 0.5 cycles/instruction (plus Pentium-III stalls)CPItotal (Pentium-4) ≈ 1.5 cycles/instruction (plus Pentium-III stalls)

Pentium-4 clock speed ≈ (Pentium-III clock speed)/1.5

Pentium-4 with 1.5 GHz clock has effective clock of a Pentium-III with 1.0 GHz clock

( ) ( )

( ) Rate Clock Effective , CPI1R

R

R1

ICCPI1R1

ICR1

ICCPI1

R1

ICCPIτICCPITimeRun

stallsuper

effective

effective

stallsuper

stallsuper

totalsuper

totalsuper

totalsuper

+=

×=+××=××+=

××=××=−



Fair Comparison

Accounting for the different meanings of clock speed:

Compared to the 1.0 GHz Pentium-III, 1.5 GHz Pentium-4 is 20% faster on SPECint2000 1.5 GHz Pentium-4 is 75% faster on SPECFP2000

Speed-up is result of the architectural enhancementsA very reasonable performance improvement

Code compiled with Pentium-4 optimization is faster than older code



Intel ItaniumIA-64



Itanium OverviewIntel's 64-bit architectural plan*

Goals of the Itanium architectureSupport 64-bit addressesIA-32 backward compatibilityIncrease instruction level parallelism (ILP)Improve branch handlingReduce hardware burden using compile-time informationImprove floating point performance

New MethodologyExplicitly Parallel Instruction Computing (EPIC)

Compiler identifies instruction dependenciesCompiler reschedules instructions for optimized executionCompiler groups instructions for parallel issue to Execution Units

Hard work done once by compiler (not each time by hardware)



Operating Environments



Data Types

Pointers: 8 bytesInteger:

1, 2, 4, and 8 bytesbyte, word, doubleword, quadword

Floating Point: single, double and double-extended formats



New Features in Itanium Instruction Set

RISC-like syntaxLoad-Store architectureUniform instruction length (41 bits)

Explicit instruction parallelismCompiler chooses instructions to run in parallelCompiler provide hints to the processorPredication replaces branching

More flexible use of registers128 integer and floating-point registersRegister renaming replaces “spill and fill” on callsRegister rotation allows parallelization of loops



Instruction Format

General Syntax:

[(qp)] mnemonic[.comp1][.comp2] dests = srcs

(qp) qualifying predicate register mnemonic name identifying the instruction [comp1][comp2] Completers indicate optional variations on basic

mnemonic dests, srcs source operands are registers or immediates

destination is typically a register



Instruction Format Examples

Simple Instructionadd r1 = r2, r3

r1 ← r2 + r3Instruction with Immediate

add r1 = r2, r3, 1r1 ← r2 + r3 + 1

Instruction with Completercmp.eq p3 = r2, r4

if (r2 eq r4) then p3 ← 1Predicated Instruction

(p4) add r1 = r2, r3if (p4=1) then {r1 ← r2 + r3}if (p4=0) then {NOP}



Predication Replaces Conditional Branches

Conditional execution of predicated instructions

Example: if (p5) r1 = r2 + r3Executes ADD if p5 = 1Executes NOP if p5 = 0

Predicate registers64 predicate registers: pr0 ⎯ pr63Set/Clear by compare instructions

Advantages over conditional branchEliminate misprediction penalties Allow larger parallel instruction blocks (no dependencies)



Predication Example

High level codeif (a > b) c = c + 1else d = d * e + f

Predicated codepT, pF = compare(a > b)if (pT) c = c + 1if (pF) d = d * e + f

Compare sets pT or pFCompiler schedules the two if instructions in parallel

No conditional branchNo misprediction penalty on either outcome



Explicitly Parallel Instruction Computing (EPIC)

Very Long Instruction Word (VLIW) formatInstruction Bundle: 3 instructions in a VLIWInstruction Group: 1 or more instruction bundles

Instruction Group: No data dependencies among instructionsMay be executed in parallel (according to program logic)

At compile time, compiler Identifies data dependenciesForms Instruction BundlesMarks Instruction GroupsDetermines ordering of instruction execution



Instruction Bundles

Instructions BundleThree Instructions and a Template Field16 byte length: 3 × 41 bits + 5 bits = 128 bitsAligned at 16-byte boundaries in memoryContain no RAW or WAW dependencies

Template FieldMaps each instruction to Execution Unit typeInteger (I), Float (F), Memory (M), Branch (B)Processor executes the three instructions in parallel



Instruction Groups

Instruction GroupSequence of Instruction bundlesInstructions without RAW or WAW dependencies At least one instructionNumber of instructions is not limited

Instruction groups end atBranch instructionsCycle Breaks (;;)

Inserted by compiler to indicate data hazards (dependencies)

Processor seeks to execute all bundles in a group in parallel



Instruction Groups ⎯ Example

r1 = r2 + r3r4 = r5 + r6r7 = r8 + r9 ;;r10 = r4 + r11



General Registers

General Registers128 registers ⎯ GR0 through GR12764-bit widthNaT (Not a Thing) bit

Mark deferred speculative exceptionsTwo Subsets

GR0 ⎯ GR31: Static General RegistersGR0 always holds zero as source

Write to GR0 causes an Illegal Operation fault

GR32 ⎯ GR127: Stacked General RegistersAvailable to application program Acquire by allocating a Register Stack FrameAct as Local and Output Registers



“Spilling” and “Filling” Problem

In an IA-32 procedure callCalling procedure uses IA-32 registersCalled procedure needs the same registers

Procedure call causes many memory accessesCalling procedure saves register values in memoryCalling task passes parameters by pushing to stackCalled task returns by register or by stackCalling task restores its previous register state



Stacked General Registers

Stacked General RegistersProcedure Calls use Temporary Registers Avoids “spilling” and “filling”

Register Frame allocated to a nested procedureAllocate up to 96 registers from (GR32 … GR127)

Specify number of required registers for:Local ⎯ private registers for use by procedureOutput ⎯ parameter/return passing



Implementation of Stacked General RegistersOn procedure calls and returns

Allocate temporary physical registersCurrent Frame Marker (CFM)

Stacked General Registers allocated to called procedurePrevious Frame Marker (PFM)

Stacked General Registers allocated to calling proceduresof ⎯ size of frame (local + output registers)sol ⎯ size of local

Implementation in hardwareInvisible to application programsRename temporary registers to standard register set

Called procedure always sees: 32 Static General Registers: GR0 ⎯ GR31Stacked General Registers: GR32, GR33, ... , GR 32+sof



Stacked General Registers ⎯ Example



Register Rotation

Modulo loop scheduling Execute loop iterations in parallelLoop iteration starts before previous iteration finishesTraditionally requires loop unrolling

Write repeated code instances, instead of writing a loop

Register Rotation Use multiple physical registersRename multiple registers to same nameProvide every iteration with its own set of registers

Each instance of loop sees same register namesAvoids unrolling



Virtual Addressing by Region

IA-64 address space 3-bit Virtual Region Number (VRN) 61-bit address within Virtual Region (VR)

261 byte address space in VRDivided into pages by the OS

3-bit Virtual Region Number (VRN)VRN is an index into a Region Register Table (RRT) RRT defines 8 Virtual Region Identifier (VRI) entries24-bit VRI identifies one of 224 Virtual Regions

Total address space = 264 – 3 + 24 = 285 bytes



Virtual Addressing



Virtual Addressing64-bit address divided into 3 fields

3-bit VRN points to 1 of 8 Virtual RegionsVR has 61-bit address (64-3 = 61), divided into pages by the OS

Supported page sizes4KB, 8KB, 16KB, 64KB, 256KB, 1MB, 4MB, 16MB, 64MB, 256MB, 4GBPage Offset = 12, 13, 14, 16, 18, 20, 22, 24, 28, 30, 32 bits

VPN = 49, 48, 47, 45, 43, 41, 39, 37, 33, 31, 29 bits

Effective IA-64 address space is 85-bits:64-3+24=85

3 bits OS dependent OS dependent

Virtual Region Number Virtual Page Number Page Offset



Itanium 2

Itanium with enhancements:• 6 integer execution units (up from 4)• 2 Load + 2 Store units (up from 2 Load/Store)• Move L3 cache onto silicon die (on chip)• I/O clock is 400 MHz (up from 266 MHz)• 128-bit data I/O (up from 64-bit)• I/O rate of 6.4 GB/s (up from 2.1 GB/s)

Date post:	04-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

1 From the 386 to the Pentium 4 Microprocessors (A) Intel ... · 1 From the 386 to the Pentium 4...

Documents