Post on 29-Aug-2020
transcript
logolund
Lecture 5: EIT090 Computer Architecture
Anders Ardö
EIT – Electrical and Information Technology, Lund University
Sept. 30, 2009
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 1 / 62
logolund
Outline
1 Reiteration
2 Dynamic scheduling - Tomasulo
3 Superscalar, VLIW
4 Speculation
5 ILP limitations
6 What we have done so far
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 2 / 62
logolund
Instruction Level Parallelism - ILP
ILP: Overlap execution of unrelated instructions: PipeliningTwo main approaches:
DYNAMIC =⇒ hardware detects parallelismSTATIC =⇒ software detects parallelism
Often a mix between both.
Pipeline CPI = Ideal CPI + Structural stalls+ Data hazard stalls + Control stalls
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 3 / 62
logolund
Why loop unrolling works
Longer sequences of straight code without branches (longer basicblocks) allows for easier compiler static reschedulingLonger basic blocks also facilitates dynamic rescheduling such asScoreboard and Tomasulo’s algorithm
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 4 / 62
logolund
Dynamic Branch Prediction
Branches limit performance because:Branch penaltiesLimit to available Instruction Level Parallelism
Solution: Dynamic branch prediction to predict the outcome ofconditional branches.Benefits:
Reduce the time to when the branch condition is knownReduce the time to calculate the branch target address
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 5 / 62
logolund
Dependencies
Two instructions must be independent in order to execute inparallelThere are three general types of dependencies that limitparallelism:
Data dependenciesName dependenciesControl dependencies
Dependencies are properties of the programWhether a dependency leads to a hazard or not is a property ofthe pipeline implementation
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 6 / 62
logolund
Scoreboard pipeline
Goal of scoreboarding is to maintain an execution rate of oneinstruction per clock cycle by executing an instruction as early aspossible.Instructions execute out-of-order when there are sufficientresources and no data dependencies.A scoreboard is a hardware unit that keeps track of
the instructions that are in the process of being executed,the functional units that are doing the executing,and the registers that will hold the results of those units.
A scoreboard centrally performs all hazard detection andresolution and thus controls the instruction progression from onestep to the next.
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 7 / 62
logolund
Summary
ILP:Rescheduling and loop unrolling are important to take advantage ofpotential Instruction Level Parallelism
Dynamic instruction schedulingAn alternative to compile-time schedulingDoes not need recompilation to increase performanceUsed in most new processor implementations
Dynamic Branch Predictionreduce branch penalties by early prediction of conditional branchoutcomes
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 8 / 62
logolund
Lecture 5 agenda
Chapters 2.4-2.8, 3.1-3.4 in "Computer Architecture"
1 Reiteration
2 Dynamic scheduling - Tomasulo
3 Superscalar, VLIW
4 Speculation
5 ILP limitations
6 What we have done so far
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 9 / 62
logolund
Outline
1 Reiteration
2 Dynamic scheduling - Tomasulo
3 Superscalar, VLIW
4 Speculation
5 ILP limitations
6 What we have done so far
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 10 / 62
logolund
Scoreboard pipeline
Issue: Decode and check for structural hazardsRead operands: wait until no data hazards, then read operandsAll data hazards are handled by the scoreboard
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 11 / 62
logolund
Limitations with Scoreboard
The number of scoreboard entries (window size)The number and types of functional unitsNumber of datapaths to registersThe presence of name dependencies
Tomasulo’s algorithm addresses the last two limitations.
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 12 / 62
logolund
Tomasulo’s Algorithm
Another dynamic instruction scheduling algorithmFor IBM 360/91, a few years after the CDC 6600 (Scoreboard)Goal: High performance without compiler supportDifferences between Tomasulo & Scoreboard:
Control & Buffers distributed with FUs (called reservationstations) vs. centralized in ScoreboardRegister names in instructions replaced by pointers to reservationstation buffer (HW register renaming)Common Data Bus broadcasts results to all FUsLoads and Stores treated as FUs as well
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 13 / 62
logolund
Tomasulo Organization
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 14 / 62
logolund
Three Stages of Tomasulo Alg.
1. Issue – get instruction from FP Op QueueIf reservation station free (no structural hazard), the instruction isissued together with its operands (renames registers)
2. Execution – operate on operands (EX)When both operands are ready, then execute; if not ready, watchCommon Data Bus (CDB) for operands (snooping)
3. Write result – finish execution (WB)Write on CDB to all awaiting functional units; mark reservationstation available
Normal bus: data + destinationCommon Data Bus: data + source (snooping)
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 15 / 62
logolund
Tomasulo example, cycle 0
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 16 / 62
logolund
Tomasulo example, cycle 1
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 17 / 62
logolund
Tomasulo example, cycle 2
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 18 / 62
logolund
Tomasulo example, cycle 3
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 19 / 62
logolund
Tomasulo example, cycle 4
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 20 / 62
logolund
Tomasulo example, cycle 5
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 21 / 62
logolund
Tomasulo example, cycle 6
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 22 / 62
logolund
Tomasulo example, cycle 7
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 23 / 62
logolund
Tomasulo example, cycle 8
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 24 / 62
logolund
Tomasulo example, cycle 10
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 25 / 62
logolund
Elimination of WAR hazards
Example:
LD F6, 34(R2)... ...DIVD F10,F0,F6ADDD F6,F8,F2
ADDD can safely finish before DIVD has read register F6because:
DIVD has renamed register F6 to point at the reservation stationLD broadcasts its result on the Common Data Bus
Register renaming can thus be done:statically by the compilerdynamically by the hardware
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 26 / 62
logolund
Tomasulo example, cycle 11
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 27 / 62
logolund
Tomasulo example, cycle 15
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 28 / 62
logolund
Tomasulo example, cycle 16
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 29 / 62
logolund
Tomasulo example, cycle 56
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 30 / 62
logolund
Tomasulo example, cycle 57
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 31 / 62
logolund
Benefits Tomasulo
distributed hazard detection logicdistributed reservation stationsCommon Data Bus (CDB) with snooping
elimination WAR,WAW hazards (renaming registers)
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 32 / 62
logolund
Dynamic scheduling - summary
tolerates unpredictable delayscompile for one pipeline - run effectively on anothersignificant increase in HW complexityout-of-order execution, completionregister renaming
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 33 / 62
logolund
Outline
1 Reiteration
2 Dynamic scheduling - Tomasulo
3 Superscalar, VLIW
4 Speculation
5 ILP limitations
6 What we have done so far
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 34 / 62
logolund
Getting CPI < 1!
Issuing multiple instructions per clock cycleSuperscalar : varying number of instructions/cycle (1-8) scheduledby compiler or HW
IBM Power5, Pentium 4, Sun SuperSparc, DEC AlphaSimple hardware, complicated compiler or...Very complex hardware but simple for compiler
Very Long Instruction Word (VLIW): fixed number of instructions(3-5) scheduled by the compiler
HP/Intel IA-64, ItaniumSimple hardware, difficult for compilerhigh performance through extensive compiler optimization
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 35 / 62
logolund
Approaches for multiple issue
Issue Hazard Scheduling Characteristicsdetection /examples
Superscalar dynamic HW static in-order executionARM
Superscalar dynamic HW dynamic out-of-orderexecution
Superscalar dynamic HW dynamic speculationPentium 4
IBM power5WLIW static compiler static TI C6xEPIC static compiler mostly static Itanium
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 36 / 62
logolund
Very Long Instruction Word (VLIW)
A number of functional units that independently executeinstructions in parallel.The compiler decides which instructions can execute in parallelNo hazard detection needed
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 37 / 62
logolund
Itanium instruction format
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 38 / 62
logolund
Itanium architecture
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 39 / 62
logolund
Limits of VLIW
Limited Instruction Level ParallelismWith n functional units and k pipeline stages we need n x kindependent instructions to utilize the hardware
Memory and register bandwidthWith increasing number of functional units, the number of portsneeded at the memory or register file must increase to preventstructural hazards
Code sizeCompiler scheduled pipeline “bubbles” take up space in theinstructionNeed more aggressive loop unrolling to work well which alsoincreases code size
No binary code compatibility
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 40 / 62
logolund
Outline
1 Reiteration
2 Dynamic scheduling - Tomasulo
3 Superscalar, VLIW
4 Speculation
5 ILP limitations
6 What we have done so far
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 41 / 62
logolund
HW supported speculation
A combination of three main ideas:Dynamic instruction scheduling; take advantage of ILPDynamic branch prediction; allows instruction scheduling acrossbranchesSpeculative execution; execute instructions before all controldependencies are resolved
Hardware based speculation uses a data-flow execution:instructions execute when their operands are available
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 42 / 62
logolund
HW vs. SW speculation
Advantages:Dynamic runtime disambiguation of memory addressesDynamic branch prediction is often better than static which limitsthe performance of SW speculationHW speculation can maintain a precise exception modelCan achieve higher performance on older code (withoutrecompilation)
Main disadvantage:Extremely complex implementation and extensive need forhardware resources
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 43 / 62
logolund
Tomasulo extended to handle speculation
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 44 / 62
logolund
Re-order buffer - ROB
Data structure
entry instruction type destination value ready12...n
supports speculative executioninstructions commit in orderprecise exceptions
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 45 / 62
logolund
Four steps of Speculative Tomasulo
Issue – get instruction from FP Op QueueIf reservation station and reorder buffer slot free, issue instr &send operands & reorder buffer nr. for destinationExecution – operate on operands (EX)If both operands ready: execute; if not, watch CDB for result;when both operands are in reservation station: executeWrite result – complete executionWrite on Common Data Bus to all awaiting FUs & reorder buffer ;mark reservation station availableCommit – update register with reorder resultWhen instr. is at head of reorder buffer & result is present; updateregister with result (or store to memory) and remove instr. fromreorder buffer;(handle misspeculations and precise exceptions)
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 46 / 62
logolund
Misspeculation!
Commit – branch prediction wrongWhen branch instr. is at head of reorder buffer & incorrect prediction:remove all instr. from reorder buffer (flush);restart execution at correct instruction
Expensive =⇒ try to recover as early as possiblePerformance sensitive to branch prediction/speculationmechanism
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 47 / 62
logolund
Multiple issue and speculation
Possible to extend Tomasulo with both multiple issue andspeculation.Major issues – instruction issue and monitoring CDBMust be able to handle multiple commitsAlternative to Tomasulo is to use extra physical registers for botharchitecturally visible registers and temporary values with registerrenaming
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 48 / 62
logolund
Tomasulo speculation - increased complexity
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 49 / 62
logolund
Dynamic scheduling, speculation - summary
tolerates unpredictable delayscompile for one pipeline - run effectively on anotherallows speculation
multiple branchesin-order commitprecise exceptionstime, energy; recovery
significant increase in HW complexityout-of-order execution, completionregister renaming
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 50 / 62
logolund
Outline
1 Reiteration
2 Dynamic scheduling - Tomasulo
3 Superscalar, VLIW
4 Speculation
5 ILP limitations
6 What we have done so far
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 51 / 62
logolund
ILP
How much performance canwe get by utilizing ILP?
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 52 / 62
logolund
A model of an ideal processor
Provides a base for ILP measurementsNo structural hazardsRegister renaming – infinite virtual registers and all WAW & WARhazards avoidedMachine with perfect speculation
Branch prediction – perfect; no mispredictionsJump prediction – all jumps perfectly predicted
Memory-address alias analysis – addresses are known & a storecan be moved before a load provided addresses not equalPerfect caches
There are only true data dependencies left!
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 53 / 62
logolund
Upper Limit to ILP
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 54 / 62
logolund
Impact window size
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 55 / 62
logolund
More realistic HW: Branch impact
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 56 / 62
logolund
More realistic HW: Register impact
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 57 / 62
logolund
Summary
Software (compiler ) tricks:Loop unrollingStatic instructionscheduling (with registerrenaming)... and more
Hardware tricks:Dynamic instructionschedulingDynamic branchpredictionMultiple issue –Superscalar, VLIWSpeculative execution... and more
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 58 / 62
logolund
Outline
1 Reiteration
2 Dynamic scheduling - Tomasulo
3 Superscalar, VLIW
4 Speculation
5 ILP limitations
6 What we have done so far
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 59 / 62
logolund
AMD Phenom CPU
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 60 / 62
logolund
Intel Core2
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 61 / 62
logolund
Intel Core2 chip (Nehalem)
A. Ardö, EIT Lecture 5: EIT090 Computer Architecture Sept. 30, 2009 62 / 62