A Critical Look At IA-64A Critical Look At IA-64Massive Resources, Massive Resources,
Massive ILP, Massive ILP, But Can It Deliver?But Can It Deliver?
Martin Hopkins, IBM ResearchMartin Hopkins, IBM Research2/7/002/7/00
Sampoorani, Sivakumar and JoshuaSampoorani, Sivakumar and Joshua
Design decisions common to Design decisions common to modern processorsmodern processors
PipeliningPipelining Micro OpsMicro Ops Large ROBLarge ROB Single path executionSingle path execution Dynamic schedulingDynamic scheduling
At what cost?At what cost?
Accurate Branch PredictionAccurate Branch Prediction Dependency CheckingDependency Checking Register RenamingRegister Renaming Alias Detection HardwareAlias Detection Hardware
Performance of IA-64Performance of IA-64
Execution time = Cycle Time *IC* CPIExecution time = Cycle Time *IC* CPINo improvement reported in frequencyNo improvement reported in frequencyPossible Reasons?Possible Reasons? Reducing CPI at the cost of cycle timeReducing CPI at the cost of cycle time
Compares and branches in same cycleCompares and branches in same cycle Predicated ExecutionPredicated Execution
=> more FUs => more FUs => more complexity => more complexity + longer wires + longer wires
limit on frequencylimit on frequency => more power => more power
Dynamic Path Length (IC)Dynamic Path Length (IC)
Longer than other architecturesLonger than other architectures
Reasons?Reasons? SpeculationSpeculation Check operations and recovery codeCheck operations and recovery code PredicationPredication No sign extended loadsNo sign extended loads No integer multiply or divideNo integer multiply or divide
Dynamic Path Length (IC)Dynamic Path Length (IC)
Loads and Stores – Only post execution Loads and Stores – Only post execution update of base registerupdate of base register
ldldsz.ldtype.ldhint r1 = sz.ldtype.ldhint r1 = [[r3r3] ] no base update formno base update form
ldldsz.ldtype.ldhint r1 = sz.ldtype.ldhint r1 = [[r3r3], ], r2r2 register base updateregister base update
ldldsz.ldtype.ldhint r1 = sz.ldtype.ldhint r1 = [[r3r3], ], imm imm immediate base updateimmediate base update
CPICPI
Cache EffectsCache Effects Larger code footprintLarger code footprint
128 bit bundle - 3 instructions128 bit bundle - 3 instructions Restrictions on placing instructionsRestrictions on placing instructions Branch target - beginning of bundleBranch target - beginning of bundle
Recovery codeRecovery code Pollutes I-Cache and/or triggers page faultsPollutes I-Cache and/or triggers page faults
Speculative loads - Pollute D-cacheSpeculative loads - Pollute D-cache
Stalls possibleStalls possible
ExampleExampleload ra = load ra = load rb = ;; // end of bundleload rb = ;; // end of bundleadd rx = raadd rx = raload ry = [rb];;load ry = [rb];;If load ra causes a cache miss, stall.If load ra causes a cache miss, stall.Superscalar out-of-order processors – can executeSuperscalar out-of-order processors – can executenon-dependent instructions in parallel with the cachenon-dependent instructions in parallel with the cachemiss.miss.
Comparing ComplexitiesComparing Complexities
Support for speculative executionSupport for speculative execution– Superscalar processorsSuperscalar processors
» reorder bufferreorder buffer
» register renaming hardwareregister renaming hardware
– EPIC EPIC » need to expose parallelism, speculationneed to expose parallelism, speculation
» hardware just does what the compiler sayshardware just does what the compiler says
IA-64: Exposing Speculative IA-64: Exposing Speculative ExecutionExecution
Control speculationControl speculation
(moving loads above branches)(moving loads above branches) Data speculationData speculation
(moving loads above stores)(moving loads above stores)
Control SpeculationControl Speculation
Hardware for deferring exceptions exposed Hardware for deferring exceptions exposed to softwareto software– NaT (Not a Thing or poison bits)NaT (Not a Thing or poison bits)
» set NaT bit associated with a register on exceptionset NaT bit associated with a register on exception
» perform an explicit check before using the registerperform an explicit check before using the register
– Increase in machine stateIncrease in machine state» 2 NaT registers2 NaT registers
» instructions to modify, test, and retrieve NaT valuesinstructions to modify, test, and retrieve NaT values
Data SpeculationData Speculation
Explicit memory-alias-detection tableExplicit memory-alias-detection table– ALAT (Advanced Load Address table)ALAT (Advanced Load Address table)
» loads place their entries in ALATloads place their entries in ALAT
» stores remove the entry if addresses matchstores remove the entry if addresses match
– Hardware cost:Hardware cost:» ALAT is 32 entry, 2 way set associativeALAT is 32 entry, 2 way set associative
» recovery code requires that operands be maintainedrecovery code requires that operands be maintained(until the store is seen the operands have to be maintained)(until the store is seen the operands have to be maintained)
» increased register requirements (128 Int + 128 FP)increased register requirements (128 Int + 128 FP)
Data Speculation Hardware CostsData Speculation Hardware Costs
Increased register pressure impliesIncreased register pressure implies– more state to be saved across functionsmore state to be saved across functions
– to avoid this:to avoid this:» Register stacking (SPARC register windows)Register stacking (SPARC register windows)
(0-31) global registers, others dynamically (0-31) global registers, others dynamically mapped mapped
» CFM (Current Frame Marker)CFM (Current Frame Marker)» Register Stack engineRegister Stack engine
Should also handle stack overflowsShould also handle stack overflows Additional complexity due to rotating Additional complexity due to rotating registersregisters
Hardware CostsHardware Costs
Reorder bufferReorder buffer Register rename Register rename
mechanismmechanism
NaT bits, associated NaT bits, associated instructionsinstructions
ALATALAT Increased number of Increased number of
registersregisters Reg Stack Engine Reg Stack Engine
– Additional Additional complexities due to complexities due to rotating registers, page rotating registers, page faults, …faults, …
Runtime InformationRuntime Information
Information about behavior of programsInformation about behavior of programs– Can’t be predicted at compile timeCan’t be predicted at compile time– Profiling helpsProfiling helps
» But costly…But costly…
Superscalar machinesSuperscalar machines– Dynamic selection of instructions to executeDynamic selection of instructions to execute– Rely upon information known at run timeRely upon information known at run time
EpicEpic
Depends mostly on compilerDepends mostly on compiler– Run time information is not used so muchRun time information is not used so much
Consider the following code sequenceConsider the following code sequencecmp p1, p2 = ..cmp p1, p2 = .. /* set predicate /* set predicate
registers */registers */(p1) br.cond low_probability_path ;;(p1) br.cond low_probability_path ;; /* if (p1) goto ...*//* if (p1) goto ...*/
ll ra = [rb];;ra = [rb];;addadd rc = ra, rd;;rc = ra, rd;;use of use of (rc)(rc)
4 bundles, load not hoisted over a branch (which is not 4 bundles, load not hoisted over a branch (which is not usually taken)usually taken)
As Scheduled by IA64 CompilerAs Scheduled by IA64 Compiler
Optimize for the most probable pathOptimize for the most probable pathl.sl.s ra = [rb];;ra = [rb];;
addadd rc = ra, rdrc = ra, rdcmp p1, p2 = ...cmp p1, p2 = ...(p1)(p1) br.cond low_probability_path ;;br.cond low_probability_path ;;check.scheck.s rc, recovery_coderc, recovery_codeuse of use of (rc)(rc)
3 bundles3 bundles
When Low Probability Path Is When Low Probability Path Is TakenTaken
Superscalar processorSuperscalar processor Execute the load as Execute the load as
early as possibleearly as possible Cancel if found to be Cancel if found to be
mis-speculatedmis-speculated
Change assumptions Change assumptions dynamicallydynamically
EPICEPIC load has to complete since load has to complete since
dependant add is in next dependant add is in next bundlebundle
may take 100s of cycles if may take 100s of cycles if the pointer is randomthe pointer is random
Heavy penalty if the Heavy penalty if the compiler gets the compiler gets the probabilities wrongprobabilities wrong
Dependence on ProfilingDependence on Profiling
RISC and CISC find profiling useful, but RISC and CISC find profiling useful, but not essentialnot essential
IA-64 is much more dependent on profilingIA-64 is much more dependent on profiling Difficulties involved with profilingDifficulties involved with profiling
– Additional responsibility for programmerAdditional responsibility for programmer– Creating a representative test suiteCreating a representative test suite– Using in demanding, diverse development Using in demanding, diverse development
environmentsenvironments
Code BloatCode Bloat
RISC instructionsRISC instructions 5050 3 instructions per 128 bits3 instructions per 128 bits 3333 Avg of 2 instructions per bundleAvg of 2 instructions per bundle 3333 Branch target at beginning of bundleBranch target at beginning of bundle 1010 Check opsCheck ops Recovery codeRecovery code 2020 No base+disp addressingNo base+disp addressing 1515 No sign-extended loadsNo sign-extended loads PredicationPredication OptimizationsOptimizations 3030IA-64 code should be 4.8 times x86 codeIA-64 code should be 4.8 times x86 code
Some things that may reduce Some things that may reduce code sizecode size
Post-increment loads can eliminate and add in a Post-increment loads can eliminate and add in a looploop– eg. accessing an array in strideseg. accessing an array in strides
Combining a compare and a logical opCombining a compare and a logical op r1 + r2 +1r1 + r2 +1 Rotating register files for s/w pipeliningRotating register files for s/w pipeliningAll the above amount to <5% difference.All the above amount to <5% difference.So net code bloat is about 4 times. (excludingSo net code bloat is about 4 times. (excludingoptimization overhead)optimization overhead)Code bloat => More memory b/w requirement.Code bloat => More memory b/w requirement.
Performance comparison Performance comparison
800MHz Itanium800MHz Itanium SPECintSPECint<68% Alpha 21264 (1GHz) (20% less power)<68% Alpha 21264 (1GHz) (20% less power)<60% P4<60% P4 (2GHz) (2GHz) SPECfpSPECfp>20% Alpha 21264 >20% Alpha 21264 >8% P4>8% P4Power – a major hurdlePower – a major hurdle
ConclusionConclusion
The IA-64 gamble – power is not going to The IA-64 gamble – power is not going to be a critical limitation in future.be a critical limitation in future.
This allows use of massive resourcesThis allows use of massive resources