William Stallings William Stallings Computer Organization
d A hit tand Architecture
Chapter 13Instruction Level ParallelismInstruction Level Parallelismand Superscalar Processors
What is Superscalar?What is Superscalar?
aCommon instructions (arithmetic, load/store, conditional branch) can be initiated and
t d i d d tlexecuted independentlyaEqually applicable to RISC & CISCaI ti ll RISCaIn practice usually RISC
Why Superscalar?Why Superscalar?
aMost operations are on scalar quantities (see RISC notes)aImprove these operations to get an overall
improvement
General Superscalar OrganizationOrganization
SuperpipelinedSuperpipelined
aMany pipeline stages need less than half a clock cycleaDouble internal clock speed gets two tasks per
external clock cycleaS l ll ll l f t h taSuperscalar allows parallel fetch execute
Superscalar vSuperpipelineSuperpipeline
LimitationsLimitationsaInstruction level parallelismaInstruction level parallelismaCompiler based optimisationaHardware techniquesaHardware techniquesaLimited by`True data dependency`True data dependency`Procedural dependency`Resource conflicts`Output dependency`Antidependency
True Data DependencyTrue Data Dependency
aADD r1, r2 (r1 := r1+r2;)aMOVE r3,r1 (r3 := r1;)aCan fetch and decode second instruction in
parallel with firstaCan NOT execute second instruction until first is
finished
Procedural DependencyProcedural Dependency
aCan not execute instructions after a branch in parallel with instructions before a branchaAlso, if instruction length is not fixed,
instructions have to be decoded to find out how many fetches are neededmany fetches are neededaThis prevents simultaneous fetches
Resource ConflictResource Conflict
aTwo or more instructions requiring access to the same resource at the same time
h`e.g. two arithmetic instructionsaCan duplicate resources` h t ith ti it`e.g. have two arithmetic units
D d iDependencies
Design IssuesDesign Issues
aInstruction level parallelism`Instructions in a sequence are independent`Execution can be overlapped`Governed by data and procedural dependency
aMachine Pa allelismaMachine Parallelism`Ability to take advantage of instruction level
parallelismparallelism`Governed by number of parallel pipelines
Instruction Issue PolicyInstruction Issue Policy
aOrder in which instructions are fetchedaOrder in which instructions are executedaOrder in which instructions change registers and
memory
In-Order Issue In Order CompletionIn-Order Completion
aIssue instructions in the order they occuraNot very efficientaMay fetch >1 instructionaInstructions must stall if necessary
In-Order Issue In-Order Completion (Diagram)Completion (Diagram)
In-Order Issue Out of Order CompletionOut-of-Order Completion
aOutput dependency`R3:= R3 + R5; (I1)`R4:= R3 + 1; (I2)`R3:= R5 + 1; (I3)`I2 depends on result of I1 data dependency`I2 depends on result of I1 - data dependency`If I3 completes before I1, the result from I1 will be
wrong - output (read-write) dependencyg p ( ) p y
In-Order Issue Out-of-Order Completion (Diagram)Completion (Diagram)
Out-of-Order IssueOut of Order CompletionOut-of-Order Completion
aDecouple decode pipeline from execution pipelineaCan continue to fetch and decode until this
pipeline is fullaWh f ti l it b il blaWhen a functional unit becomes available an
instruction can be executedaSi i t ti h b d d daSince instructions have been decoded, processor
can look ahead
Out-of-Order Issue Out-of-Order Completion (Diagram)Completion (Diagram)
AntidependencyAntidependency
aWrite-write dependency`R3:=R3 + R5; (I1)`R4:=R3 + 1; (I2)`R3:=R5 + 1; (I3)`R7:=R3 + R4; (I4)`R7:=R3 + R4; (I4)`I3 can not complete before I2 starts as I2 needs a
value in R3 and I3 changes R3g
Register RenamingRegister Renaming
aOutput and antidependencies occur because register contents may not reflect the correct
d i f thordering from the programaMay result in a pipeline stallaR i t ll t d d i llaRegisters allocated dynamically`i.e. registers are not specifically named
Register Renaming exampleRegister Renaming exampleaR3b:=R3a + R5a (I1)aR3b: R3a + R5a (I1)aR4b:=R3b + 1 (I2)aR3c:=R5a + 1 (I3)aR3c:=R5a + 1 (I3)aR7b:=R3c + R4b (I4)aWithout subscript refers to logical register inaWithout subscript refers to logical register in
instructionaWith subscript is hardware register allocatedaWith subscript is hardware register allocatedaNote R3a R3b R3c
Machine ParallelismMachine Parallelism
aDuplication of ResourcesaOut of order issueaRenamingaNot worth duplication functions without register
renamingaNeed instruction window large enough (more
than 8)
Branch PredictionBranch Prediction
a80486 fetches both next sequential instruction after branch and branch target instructionaGives two cycle delay if branch taken
RISC Delayed BranchRISC - Delayed Branch
aCalculate result of branch before unusable instructions pre-fetchedaAlways execute single instruction immediately
following branchaK i li f ll hil f t hi i t tiaKeeps pipeline full while fetching new instruction
streamaN t d f laNot as good for superscalar`Multiple instructions need to execute in delay slot`Instruction dependence problems`Instruction dependence problems
aRevert to branch prediction
Superscalar ExecutionSuperscalar Execution
Superscalar ImplementationSuperscalar Implementation
aSimultaneously fetch multiple instructionsaLogic to determine true dependencies involving
register valuesaMechanisms to communicate these valuesaMechanisms to initiate multiple instructions in
parallelaResources for parallel execution of multiple
instructionsaM h i f i i iaMechanisms for committing process state in
correct order
Required ReadingRequired Reading
aStallings chapter 13aManufacturers web sitesaIMPACT web site`research on predicated execution