Chapter 12 Reduced Instruction Set...

YonseiYonsei UniversityUniversity

Chapter 12Chapter 12

Reduced InstructionReduced InstructionSet ComputersSet Computers

YonseiYonsei UniversityUniversity12-2

• Instruction Execution Characteristics• The Use of a Large Register File• Compiler-based Register Optimization• Reduced Instruction Set Architecture• RISC Pipelining • MIPS R4000• SPARC• RISC versus CISC Controversy

Contents Contents


Major Advances in ComputersMajor Advances in Computers• The family concept

– IBM System/360 1964 and DEC PDP-8– Separates architecture from implementation

• Microprogrammed control unit– Produced by IBM S/360 1964

• Cache memory– IBM S/360 model 85 1969

• Solid State RAM• Microprocessors• Pipelining

– Introduces parallelism into fetch execute cycle• Multiple processors

Introduction Introduction


The Next Step The Next Step -- RISCRISC

• Reduced Instruction Set Computer

• Key features– Large number of general purpose registers or use

of compiler technology to optimize register use– Limited and simple instruction set– Emphasis on optimising the instruction pipeline

Introduction Introduction


Comparison of ProcessorsComparison of Processors Introduction Introduction

643216-321283286464Cache size

-----246480420control memory

size(kbits)

3240-520

323240-520

81616Number of

general purpose register

1121111224Addressing

modes

444441-112-572-6Instruction size (bytes)

2259469235303208Number of instructions

19961996199319911987198919781973Year developed

MIPSR10000

UltraSPARC

PowerPCMIPSR4000

SPARCIntel

40486VAX

11/780IBM

370/168characteristic

SuperscalarReduced

instruction set

(RISC)computer

Complex Instruction set

CISC(Computer)


Driving Driving FForceorce for CISCfor CISC

• Software costs far exceed hardware costs• Increasingly complex high level languages• Semantic gap• Leads to:

– Large instruction sets– More addressing modes– Hardware implementations of HLL statements

• e.g. CASE (switch) on VAX

Instruction executionInstruction executioncharacteristics characteristics


Intention of CISCIntention of CISC

• Ease compiler writing• Improves execution efficiency

– Complex operations in microcode

• Supports more complex HLLs



Execution CharacteristicsExecution Characteristics

• Operations performed• Operands used• Execution sequencing• Studies have been done based on programs

written in HLLs• Dynamic studies are measured during the

execution of the program



OperationsOperations

• Assignments– Movement of data

• Conditional statements (IF, LOOP)– Sequence control

• Procedure call-return is very time consuming• Some HLL instruction lead to many machine

code operations



Relative Dynamic FrequencyRelative Dynamic Frequency Instruction executionInstruction executioncharacteristics characteristics

121316Other

----3-GoTo

13721114329If

454433311215Call

2633324235Loop

151413133845Assign

CPascalCPascalCPascal

Memory ReferenceWeighted

Machine-InstructionWeighted

Dynamic Occurrence


OperandsOperands

• Mainly local scalar variables• Optimization should concentrate on

accessing local variables


252426Array

structure

555358Scalar

variable

202316Integer

constant

AverageCPascal


Procedure CallsProcedure Calls• Very time consuming• Depends on number of parameters passed• Depends on level of nesting• Most programs do not do a lot of calls

followed by lots of returns• Most variables are local• (c.f. locality of reference)



Procedure CallsProcedure Calls• Procedure Argument and Local Scalar

Variables– The number of words required per procedure

activation is not large



ImplicationsImplications• Best support is given by optimizing most

used and most time consuming features

• Large number of registers– Operand referencing

• Careful design of pipelines– Branch prediction etc.

• Simplified (reduced) instruction set



Large Register FileLarge Register File• Needed : a strategy that allows the most

frequently accessed operands to be kept in registers and to minimize registers-memory operations– Software solution : Rely on the compiler to

maximize register usage• Require compiler to allocate registers• Allocate based on most used variables in a

given time• Requires sophisticated program analysis

– Hardware solution• Have more registers• Thus more variables will be in registers

The use of a large The use of a large register fileregister file


Registers for Local VariablesRegisters for Local Variables• Local variables must be saved from the

registers into memory• Store local scalar variables in registers• Reduce memory access• Every procedure (function) call changes

locality• Parameters must be passed• Results must be returned• Variables from calling programs must be

restored



Register WindowsRegister Windows• Only few parameters• Limited range of depth of call• Use multiple small sets of registers• Calls switch to a different set of registers• Returns switch back to a previously used

set of registers



Register WindowsRegister Windows• Three areas within a register set

– Parameter registers• Hold parameters passed down from the current

procedure• Hold the results to be passed back up

– Local registers• Used for local variables

– Temporary registers• Used to exchange parameters and results with the next

lower level• The temporary registers at one level are physically the

same as the parameter registers at the next level• This overlap permits parameters to be passed without

the actual movement of data



Overlapping Register WindowsOverlapping Register Windows The use of a large The use of a large register fileregister file


Circular Buffer Circular Buffer DDiagramiagram The use of a large The use of a large register fileregister file


Operation of Circular BufferOperation of Circular Buffer• When a call is made, a current window

pointer is moved to show the currently active register window

• If all windows are in use, an interrupt is generated and the oldest window (the one furthest back in the call nesting) is saved to memory

• A saved window pointer indicates where the next saved windows should restore to



Global VariablesGlobal Variables• Allocated by the compiler to memory

– Inefficient for frequently accessed variables

• Have a set of registers for global variables



Registers Registers vsvs CacheCache• The register file acts as a small, fast buffer for a

subset of all variables• Register file acts much like a cache memory

• Large Register File Cache

• All local scalars Recently used local scalars• Individual variables Blocks of memory• Compiler assigned global variables Recently used global variables• Save/restore based on procedure Save/restore based on

nesting caching algorithm • Register addressing Memory addressing



Registers Registers vsvs CacheCache• The choice between a large register file and a

cache is not clear-cut• The register approach is superior

– A cache-based system will be noticeably slower• The amount of addressing overhead is larger



Referencing a ScalarReferencing a Scalar The use of a large The use of a large register fileregister file

Referencing a Scalar Referencing a Scalar -- Register FileRegister File


Referencing a Scalar Referencing a Scalar -- CacheCache The use of a large The use of a large register fileregister file


CompilerCompiler--Based Register Based Register OptimizationOptimization

• Assume small number of registers (16-32)• Optimizing use is up to compiler• HLL programs have no explicit references

to registers– usually - think about C - register int

• Assign symbolic or virtual register to each candidate variable

• Map (unlimited) symbolic registers to real registers

• Symbolic registers that do not overlap can share real registers

• If you run out of real registers some variables use memory

CompilerCompiler--based based register optimizationregister optimization


CompilerCompiler--Based Register Based Register OptimizationOptimization

• The essential task– To decide which quantities are to be assigned

to registers at any given point in the program

• The technique most commonly used– Graph coloring



Graph Graph ColoringColoring• Given a graph of nodes and edges• Assign a color to each node• Adjacent nodes have different colors• Use minimum number of colors• Nodes are symbolic registers• Two registers that are live in the same

program fragment are joined by an edge• Try to color the graph with n colors, where

n is the number of real registers• Nodes that can not be colored are placed

in memory



Graph Graph ColoringColoring ApproachApproach CompilerCompiler--based based register optimizationregister optimization


Register Register OptimizationsOptimizations

• Trade-off between the use of a large set of registers and compiler-based register optimization

• With even simple register opt.– The use of more than 64-registers is better

• With reasonably sophisticated register opt.– The use of more than 32-registers only

improve performance marginally

• With a shared register optimization– The use of a small number of registers



CISC and RISCCISC and RISC

• Motivation– To simplify compilers– To improve performance

Reduced instruction set Reduced instruction set architecturearchitecture


Why CISC?Why CISC?

• Compiler simplification?– Disputed…– Complex machine instructions harder to exploit– Optimization more difficult

• Smaller programs?– Program takes up less memory but…– Memory is now cheap– May not occupy less bits, just look shorter in

symbolic form• More instructions require longer op-codes• Register references require fewer bits

– The number of bits of memory occupied may not be noticeably smaller



Code Size Relative to RISCCode Size Relative to RISC Reduced instruction set Reduced instruction set architecturearchitecture


Why CISC?Why CISC?• Faster programs?

– Bias towards use of simpler instructions– More complex control unit– Microprogram control store larger– Thus simple instructions take longer to execute– The speedup in the execution is due not so

much to the power of the complex machine instructions as their residence in high-speed control store

• It is far from clear that CISC is the appropriate solution



RISC CharacteristicsRISC Characteristics• One instruction per cycle• Register to register operations• Few, simple addressing modes• Few, simple instruction formats• Hardwired design (no microcode)• Fixed instruction format• More compile time/effort



RISC CharacteristicsRISC Characteristics• One instruction per cycle

– One machine instruction per machine cycle– A machine cycle : the time it takes to fetch two

operands from registers, perform an ALU operation and store the result in a register

• Register to register operations– Most operations should be register-to-register

with only simple LOAD and STORE operations– The design feature simplifies the instruction

set and the control unit



RISC CharacteristicsRISC Characteristics• Few, simple addressing modes

– Almost all instructions use register addressing– Several additional modes

• Displacement• PC-relative

• Few, simple instruction formats– Only one or a few formats are used– Instruction length is fixed and aligned on word

boundaries



RR--toto--R R vsvs MM--toto--MM Reduced instruction set Reduced instruction set architecturearchitecture


BBenefitsenefits of RISCof RISC

• Related to performance• Related to VLSI implementation



Benefits of RISCBenefits of RISC• Related to performance

– More effective optimizing compiler can be developed

– Most instructions generated by a compiler are relatively simple

– The use of the instruction pipelining– RISC programs should be more responsive to

interrupts because interrupts are checked between rather elementary operations



Benefits of RISCBenefits of RISC

• Related to VLSI implementation– Possible to put an entire processor on a single

chip– On-chip delays are of much shorter duration

than interchip delays– Design-and-implementation time



Design and Layout EffortDesign and Layout Effort Reduced instruction set Reduced instruction set architecturearchitecture


Characteristics of ProcessorsCharacteristics of Processors Reduced instruction Reduced instruction set architectureset architecture


RISC PipeliningRISC Pipelining• Most instructions are register to register• Two phases of execution

– I : Instruction fetch– E: Execute

• ALU operation with register input and output

• For load and store– I : Instruction fetch– E: Execute

• Calculate memory address– D: Memory

• Register to memory or memory to register operation

• If an instruction needs an operand that is altered by the preceding instruction, a delay is required– This delay can be accomplished by a NOOP

RISC pipeliningRISC pipelining


Sequential ExecutionSequential Execution RISC pipeliningRISC pipelining


TwoTwo--way Pipelined Timingway Pipelined Timing RISC pipeliningRISC pipelining


ThreeThree--way Pipelined Timingway Pipelined Timing RISC pipeliningRISC pipelining


FourFour--way Pipelined Timingway Pipelined Timing RISC pipeliningRISC pipelining


OptimizationOptimization of Pipeliningof Pipelining• Data and branch dependencies reduce the

overall execution rate• Delayed branch

– Does not take effect until after execution of following instruction

– This following instruction is the delay slot



Normal and Delayed BranchNormal and Delayed Branch RISC pipeliningRISC pipelining

LOAD X, A

JUMP 105

ADD 1, A

ADD A, B

SUB C, B

STORE A, Z

LOAD X, A

ADD 1, A

JUMP 106

NOP

ADD A, B

SUB C, B

STORE A, Z

LOAD X, A

ADD 1, A

JUMP 105

ADD A, B

SUB C, B

STORE A, Z

100

101

102

103

104

105

106

Optimized Delayed Branch

Delayed Branch

Normal BranchAddress


Use of Delayed BranchUse of Delayed Branch RISC pipeliningRISC pipelining


Delayed Delayed BBranchranch• The interchange of instructions will work

successfully for unconditional branches, cells and returns– Cannot be blindly applied for conditional branches– If the condition that is tested for the branch can be

altered by the immediately preceding instruction, the compiler must refrain from doing the interchange and instead insert a NOOP



Delayed Delayed BBranchranch• Delayed load can be used on LOAD

instructions– On the LOAD instruction, the register that is to be

the target of the load is locked by the processor– The processor continues execution of the

instruction stream until it reaches an instruction requiring that register

• At that point, it idles until the load is complete

• The scheduling of instructions for the pipeline and the dynamic allocation of registers should be considered together to achieve the greatest efficiency



MIPS R4000MIPS R4000• 64 rather than 32bits for all internal and

external data paths and for addresses, registers and the ALU

• The processor chip– Partitioned into two sections

• One containing the CPU• The other containing a coprocessor for memory

management– Supports 32 64-bit registers– Provides for up to 128 kbytes of high-speed cache,

half each for instructions and data

MIPS R4000MIPS R4000


MIPS RMIPS R--series Instruction Setseries Instruction Set MIPS R4000MIPS R4000


Additional R4000 InstructionAdditional R4000 Instruction MIPS R4000MIPS R4000


MIPS Instruction FormatsMIPS Instruction Formats MIPS R4000MIPS R4000


Other Addressing ModesOther Addressing Modes

• Synthesizing other addressing modes with the MIPS addressing mode



Instruction PipelineInstruction Pipeline• Two classes of processors have evolved

to offer execution of multiple instructions per clock cycle – Superscalar architecture– Superpipelined architecture



Instruction PipelineInstruction Pipeline• Superscalar architecture

– Replicates each of the pipeline stages so that two or more instructions at the same stage of the pipeline can be processed simultaneously

• Superpipelined architecture– Makes use of more fine-grained, pipeline

stages– With more stages, more instructions can be in

the pipeline at the same time, increasing parallelism



Instruction PipelineInstruction Pipeline• Both approaches have limitations

– With superscalar architecture• Dependencies between instructions in

different pipelines can slow down the system• Overhead logic is required to coordinate these

dependencies– With superpipelining

• Overhead associated with transferring instructions from one stage to the next



R3000R3000

• Five pipeline stages– Instruction fetch– Source operand fetch from register file– ALU operation or data operand address

generation– Data memory reference– Write back into register file



R3000 PipelineR3000 Pipeline MIPS R4000MIPS R4000


R3000 Pipeline StagesR3000 Pipeline Stages MIPS R4000MIPS R4000


R4000 R4000 • Eight pipeline stages

– Instruction fetch first half– Instruction fetch second half– Register file– Instruction execute– Data cache first– Data cache second– Tag check– Write back



R3000 and Actual R4000R3000 and Actual R4000 MIPS R4000MIPS R4000


SPARC Register Window LayoutSPARC Register Window Layout SPARCSPARC


Eight Register WindowsEight Register Windows SPARCSPARC


SPARC Register OverlapSPARC Register Overlap• The calling procedure places any parameter to

be passed in its outs registers• The called procedure treats these same

physical registers as its ins registers• The processor maintains a current window

pointer(CWP)– CWP is located in the processor status

register(PSR)• PSR points to the window of the currently

executing procedure

• The window valid mask indicates which window is invalid

SPARCSPARC


Instruction SetInstruction Set• Available ALU operations

– Integer addition (with or without carry)– Integer subtraction (with or without carry)– Bitwise Boolean AND, OR, XOR and their

negations– Shift left logical, right logical or right arithmetic

SPARCSPARC


SPARC Instruction SetSPARC Instruction Set SPARCSPARC


SynthesizingSynthesizing• Synthesizing other addressing modes with

SPARC addressing modes

SPARCSPARC


SPARC Instruction FormatSPARC Instruction Format SPARCSPARC


ControversyControversy• Quantitative

– compare program sizes and execution speeds

• Qualitative– examine issues of high level language support

and use of VLSI real estate

• Problems– No pair of RISC and CISC that are directly

comparable– No definitive set of test programs– Difficult to separate hardware effects from

complier effects– Most comparisons done on “toy” rather than

production machines– Most commercial devices are a mixture

ControversyControversy

Date post:	14-Jun-2020
Category:	Documents
Upload:	others
View:	11 times
Download:	0 times

Chapter 12 Reduced Instruction Set...

Documents