YonseiYonsei UniversityUniversity
Chapter 12Chapter 12
Reduced InstructionReduced InstructionSet ComputersSet Computers
YonseiYonsei UniversityUniversity12-2
• Instruction Execution Characteristics• The Use of a Large Register File• Compiler-based Register Optimization• Reduced Instruction Set Architecture• RISC Pipelining • MIPS R4000• SPARC• RISC versus CISC Controversy
Contents Contents
YonseiYonsei UniversityUniversity12-3
Major Advances in ComputersMajor Advances in Computers• The family concept
– IBM System/360 1964 and DEC PDP-8– Separates architecture from implementation
• Microprogrammed control unit– Produced by IBM S/360 1964
• Cache memory– IBM S/360 model 85 1969
• Solid State RAM• Microprocessors• Pipelining
– Introduces parallelism into fetch execute cycle• Multiple processors
Introduction Introduction
YonseiYonsei UniversityUniversity12-4
The Next Step The Next Step -- RISCRISC
• Reduced Instruction Set Computer
• Key features– Large number of general purpose registers or use
of compiler technology to optimize register use– Limited and simple instruction set– Emphasis on optimising the instruction pipeline
Introduction Introduction
YonseiYonsei UniversityUniversity12-5
Comparison of ProcessorsComparison of Processors Introduction Introduction
643216-321283286464Cache size
-----246480420control memory
size(kbits)
3240-520
323240-520
81616Number of
general purpose register
1121111224Addressing
modes
444441-112-572-6Instruction size (bytes)
2259469235303208Number of instructions
19961996199319911987198919781973Year developed
MIPSR10000
UltraSPARC
PowerPCMIPSR4000
SPARCIntel
40486VAX
11/780IBM
370/168characteristic
SuperscalarReduced
instruction set
(RISC)computer
Complex Instruction set
CISC(Computer)
YonseiYonsei UniversityUniversity12-6
Driving Driving FForceorce for CISCfor CISC
• Software costs far exceed hardware costs• Increasingly complex high level languages• Semantic gap• Leads to:
– Large instruction sets– More addressing modes– Hardware implementations of HLL statements
• e.g. CASE (switch) on VAX
Instruction executionInstruction executioncharacteristics characteristics
YonseiYonsei UniversityUniversity12-7
Intention of CISCIntention of CISC
• Ease compiler writing• Improves execution efficiency
– Complex operations in microcode
• Supports more complex HLLs
Instruction executionInstruction executioncharacteristics characteristics
YonseiYonsei UniversityUniversity12-8
Execution CharacteristicsExecution Characteristics
• Operations performed• Operands used• Execution sequencing• Studies have been done based on programs
written in HLLs• Dynamic studies are measured during the
execution of the program
Instruction executionInstruction executioncharacteristics characteristics
YonseiYonsei UniversityUniversity12-9
OperationsOperations
• Assignments– Movement of data
• Conditional statements (IF, LOOP)– Sequence control
• Procedure call-return is very time consuming• Some HLL instruction lead to many machine
code operations
Instruction executionInstruction executioncharacteristics characteristics
YonseiYonsei UniversityUniversity12-10
Relative Dynamic FrequencyRelative Dynamic Frequency Instruction executionInstruction executioncharacteristics characteristics
121316Other
----3-GoTo
13721114329If
454433311215Call
2633324235Loop
151413133845Assign
CPascalCPascalCPascal
Memory ReferenceWeighted
Machine-InstructionWeighted
Dynamic Occurrence
YonseiYonsei UniversityUniversity12-11
OperandsOperands
• Mainly local scalar variables• Optimization should concentrate on
accessing local variables
Instruction executionInstruction executioncharacteristics characteristics
252426Array
structure
555358Scalar
variable
202316Integer
constant
AverageCPascal
YonseiYonsei UniversityUniversity12-12
Procedure CallsProcedure Calls• Very time consuming• Depends on number of parameters passed• Depends on level of nesting• Most programs do not do a lot of calls
followed by lots of returns• Most variables are local• (c.f. locality of reference)
Instruction executionInstruction executioncharacteristics characteristics
YonseiYonsei UniversityUniversity12-13
Procedure CallsProcedure Calls• Procedure Argument and Local Scalar
Variables– The number of words required per procedure
activation is not large
Instruction executionInstruction executioncharacteristics characteristics
YonseiYonsei UniversityUniversity12-14
ImplicationsImplications• Best support is given by optimizing most
used and most time consuming features
• Large number of registers– Operand referencing
• Careful design of pipelines– Branch prediction etc.
• Simplified (reduced) instruction set
Instruction executionInstruction executioncharacteristics characteristics
YonseiYonsei UniversityUniversity12-15
Large Register FileLarge Register File• Needed : a strategy that allows the most
frequently accessed operands to be kept in registers and to minimize registers-memory operations– Software solution : Rely on the compiler to
maximize register usage• Require compiler to allocate registers• Allocate based on most used variables in a
given time• Requires sophisticated program analysis
– Hardware solution• Have more registers• Thus more variables will be in registers
The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-16
Registers for Local VariablesRegisters for Local Variables• Local variables must be saved from the
registers into memory• Store local scalar variables in registers• Reduce memory access• Every procedure (function) call changes
locality• Parameters must be passed• Results must be returned• Variables from calling programs must be
restored
The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-17
Register WindowsRegister Windows• Only few parameters• Limited range of depth of call• Use multiple small sets of registers• Calls switch to a different set of registers• Returns switch back to a previously used
set of registers
The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-18
Register WindowsRegister Windows• Three areas within a register set
– Parameter registers• Hold parameters passed down from the current
procedure• Hold the results to be passed back up
– Local registers• Used for local variables
– Temporary registers• Used to exchange parameters and results with the next
lower level• The temporary registers at one level are physically the
same as the parameter registers at the next level• This overlap permits parameters to be passed without
the actual movement of data
The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-19
Overlapping Register WindowsOverlapping Register Windows The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-20
Circular Buffer Circular Buffer DDiagramiagram The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-21
Operation of Circular BufferOperation of Circular Buffer• When a call is made, a current window
pointer is moved to show the currently active register window
• If all windows are in use, an interrupt is generated and the oldest window (the one furthest back in the call nesting) is saved to memory
• A saved window pointer indicates where the next saved windows should restore to
The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-22
Global VariablesGlobal Variables• Allocated by the compiler to memory
– Inefficient for frequently accessed variables
• Have a set of registers for global variables
The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-23
Registers Registers vsvs CacheCache• The register file acts as a small, fast buffer for a
subset of all variables• Register file acts much like a cache memory
• Large Register File Cache
• All local scalars Recently used local scalars• Individual variables Blocks of memory• Compiler assigned global variables Recently used global variables• Save/restore based on procedure Save/restore based on
nesting caching algorithm • Register addressing Memory addressing
The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-24
Registers Registers vsvs CacheCache• The choice between a large register file and a
cache is not clear-cut• The register approach is superior
– A cache-based system will be noticeably slower• The amount of addressing overhead is larger
The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-25
Referencing a ScalarReferencing a Scalar The use of a large The use of a large register fileregister file
Referencing a Scalar Referencing a Scalar -- Register FileRegister File
YonseiYonsei UniversityUniversity12-26
Referencing a Scalar Referencing a Scalar -- CacheCache The use of a large The use of a large register fileregister file
YonseiYonsei UniversityUniversity12-27
CompilerCompiler--Based Register Based Register OptimizationOptimization
• Assume small number of registers (16-32)• Optimizing use is up to compiler• HLL programs have no explicit references
to registers– usually - think about C - register int
• Assign symbolic or virtual register to each candidate variable
• Map (unlimited) symbolic registers to real registers
• Symbolic registers that do not overlap can share real registers
• If you run out of real registers some variables use memory
CompilerCompiler--based based register optimizationregister optimization
YonseiYonsei UniversityUniversity12-28
CompilerCompiler--Based Register Based Register OptimizationOptimization
• The essential task– To decide which quantities are to be assigned
to registers at any given point in the program
• The technique most commonly used– Graph coloring
CompilerCompiler--based based register optimizationregister optimization
YonseiYonsei UniversityUniversity12-29
Graph Graph ColoringColoring• Given a graph of nodes and edges• Assign a color to each node• Adjacent nodes have different colors• Use minimum number of colors• Nodes are symbolic registers• Two registers that are live in the same
program fragment are joined by an edge• Try to color the graph with n colors, where
n is the number of real registers• Nodes that can not be colored are placed
in memory
CompilerCompiler--based based register optimizationregister optimization
YonseiYonsei UniversityUniversity12-30
Graph Graph ColoringColoring ApproachApproach CompilerCompiler--based based register optimizationregister optimization
YonseiYonsei UniversityUniversity12-31
Register Register OptimizationsOptimizations
• Trade-off between the use of a large set of registers and compiler-based register optimization
• With even simple register opt.– The use of more than 64-registers is better
• With reasonably sophisticated register opt.– The use of more than 32-registers only
improve performance marginally
• With a shared register optimization– The use of a small number of registers
CompilerCompiler--based based register optimizationregister optimization
YonseiYonsei UniversityUniversity12-32
CISC and RISCCISC and RISC
• Motivation– To simplify compilers– To improve performance
Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-33
Why CISC?Why CISC?
• Compiler simplification?– Disputed…– Complex machine instructions harder to exploit– Optimization more difficult
• Smaller programs?– Program takes up less memory but…– Memory is now cheap– May not occupy less bits, just look shorter in
symbolic form• More instructions require longer op-codes• Register references require fewer bits
– The number of bits of memory occupied may not be noticeably smaller
Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-34
Code Size Relative to RISCCode Size Relative to RISC Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-35
Why CISC?Why CISC?• Faster programs?
– Bias towards use of simpler instructions– More complex control unit– Microprogram control store larger– Thus simple instructions take longer to execute– The speedup in the execution is due not so
much to the power of the complex machine instructions as their residence in high-speed control store
• It is far from clear that CISC is the appropriate solution
Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-36
RISC CharacteristicsRISC Characteristics• One instruction per cycle• Register to register operations• Few, simple addressing modes• Few, simple instruction formats• Hardwired design (no microcode)• Fixed instruction format• More compile time/effort
Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-37
RISC CharacteristicsRISC Characteristics• One instruction per cycle
– One machine instruction per machine cycle– A machine cycle : the time it takes to fetch two
operands from registers, perform an ALU operation and store the result in a register
• Register to register operations– Most operations should be register-to-register
with only simple LOAD and STORE operations– The design feature simplifies the instruction
set and the control unit
Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-38
RISC CharacteristicsRISC Characteristics• Few, simple addressing modes
– Almost all instructions use register addressing– Several additional modes
• Displacement• PC-relative
• Few, simple instruction formats– Only one or a few formats are used– Instruction length is fixed and aligned on word
boundaries
Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-39
RR--toto--R R vsvs MM--toto--MM Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-40
BBenefitsenefits of RISCof RISC
• Related to performance• Related to VLSI implementation
Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-41
Benefits of RISCBenefits of RISC• Related to performance
– More effective optimizing compiler can be developed
– Most instructions generated by a compiler are relatively simple
– The use of the instruction pipelining– RISC programs should be more responsive to
interrupts because interrupts are checked between rather elementary operations
Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-42
Benefits of RISCBenefits of RISC
• Related to VLSI implementation– Possible to put an entire processor on a single
chip– On-chip delays are of much shorter duration
than interchip delays– Design-and-implementation time
Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-43
Design and Layout EffortDesign and Layout Effort Reduced instruction set Reduced instruction set architecturearchitecture
YonseiYonsei UniversityUniversity12-44
Characteristics of ProcessorsCharacteristics of Processors Reduced instruction Reduced instruction set architectureset architecture
YonseiYonsei UniversityUniversity12-45
RISC PipeliningRISC Pipelining• Most instructions are register to register• Two phases of execution
– I : Instruction fetch– E: Execute
• ALU operation with register input and output
• For load and store– I : Instruction fetch– E: Execute
• Calculate memory address– D: Memory
• Register to memory or memory to register operation
• If an instruction needs an operand that is altered by the preceding instruction, a delay is required– This delay can be accomplished by a NOOP
RISC pipeliningRISC pipelining
YonseiYonsei UniversityUniversity12-46
Sequential ExecutionSequential Execution RISC pipeliningRISC pipelining
YonseiYonsei UniversityUniversity12-47
TwoTwo--way Pipelined Timingway Pipelined Timing RISC pipeliningRISC pipelining
YonseiYonsei UniversityUniversity12-48
ThreeThree--way Pipelined Timingway Pipelined Timing RISC pipeliningRISC pipelining
YonseiYonsei UniversityUniversity12-49
FourFour--way Pipelined Timingway Pipelined Timing RISC pipeliningRISC pipelining
YonseiYonsei UniversityUniversity12-50
OptimizationOptimization of Pipeliningof Pipelining• Data and branch dependencies reduce the
overall execution rate• Delayed branch
– Does not take effect until after execution of following instruction
– This following instruction is the delay slot
RISC pipeliningRISC pipelining
YonseiYonsei UniversityUniversity12-51
Normal and Delayed BranchNormal and Delayed Branch RISC pipeliningRISC pipelining
LOAD X, A
JUMP 105
ADD 1, A
ADD A, B
SUB C, B
STORE A, Z
LOAD X, A
ADD 1, A
JUMP 106
NOP
ADD A, B
SUB C, B
STORE A, Z
LOAD X, A
ADD 1, A
JUMP 105
ADD A, B
SUB C, B
STORE A, Z
100
101
102
103
104
105
106
Optimized Delayed Branch
Delayed Branch
Normal BranchAddress
YonseiYonsei UniversityUniversity12-52
Use of Delayed BranchUse of Delayed Branch RISC pipeliningRISC pipelining
YonseiYonsei UniversityUniversity12-53
Delayed Delayed BBranchranch• The interchange of instructions will work
successfully for unconditional branches, cells and returns– Cannot be blindly applied for conditional branches– If the condition that is tested for the branch can be
altered by the immediately preceding instruction, the compiler must refrain from doing the interchange and instead insert a NOOP
RISC pipeliningRISC pipelining
YonseiYonsei UniversityUniversity12-54
Delayed Delayed BBranchranch• Delayed load can be used on LOAD
instructions– On the LOAD instruction, the register that is to be
the target of the load is locked by the processor– The processor continues execution of the
instruction stream until it reaches an instruction requiring that register
• At that point, it idles until the load is complete
• The scheduling of instructions for the pipeline and the dynamic allocation of registers should be considered together to achieve the greatest efficiency
RISC pipeliningRISC pipelining
YonseiYonsei UniversityUniversity12-55
MIPS R4000MIPS R4000• 64 rather than 32bits for all internal and
external data paths and for addresses, registers and the ALU
• The processor chip– Partitioned into two sections
• One containing the CPU• The other containing a coprocessor for memory
management– Supports 32 64-bit registers– Provides for up to 128 kbytes of high-speed cache,
half each for instructions and data
MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-56
MIPS RMIPS R--series Instruction Setseries Instruction Set MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-57
Additional R4000 InstructionAdditional R4000 Instruction MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-58
MIPS Instruction FormatsMIPS Instruction Formats MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-59
Other Addressing ModesOther Addressing Modes
• Synthesizing other addressing modes with the MIPS addressing mode
MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-60
Instruction PipelineInstruction Pipeline• Two classes of processors have evolved
to offer execution of multiple instructions per clock cycle – Superscalar architecture– Superpipelined architecture
MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-61
Instruction PipelineInstruction Pipeline• Superscalar architecture
– Replicates each of the pipeline stages so that two or more instructions at the same stage of the pipeline can be processed simultaneously
• Superpipelined architecture– Makes use of more fine-grained, pipeline
stages– With more stages, more instructions can be in
the pipeline at the same time, increasing parallelism
MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-62
Instruction PipelineInstruction Pipeline• Both approaches have limitations
– With superscalar architecture• Dependencies between instructions in
different pipelines can slow down the system• Overhead logic is required to coordinate these
dependencies– With superpipelining
• Overhead associated with transferring instructions from one stage to the next
MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-63
R3000R3000
• Five pipeline stages– Instruction fetch– Source operand fetch from register file– ALU operation or data operand address
generation– Data memory reference– Write back into register file
MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-64
R3000 PipelineR3000 Pipeline MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-65
R3000 Pipeline StagesR3000 Pipeline Stages MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-66
R4000 R4000 • Eight pipeline stages
– Instruction fetch first half– Instruction fetch second half– Register file– Instruction execute– Data cache first– Data cache second– Tag check– Write back
MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-67
R3000 and Actual R4000R3000 and Actual R4000 MIPS R4000MIPS R4000
YonseiYonsei UniversityUniversity12-68
SPARC Register Window LayoutSPARC Register Window Layout SPARCSPARC
YonseiYonsei UniversityUniversity12-69
Eight Register WindowsEight Register Windows SPARCSPARC
YonseiYonsei UniversityUniversity12-70
SPARC Register OverlapSPARC Register Overlap• The calling procedure places any parameter to
be passed in its outs registers• The called procedure treats these same
physical registers as its ins registers• The processor maintains a current window
pointer(CWP)– CWP is located in the processor status
register(PSR)• PSR points to the window of the currently
executing procedure
• The window valid mask indicates which window is invalid
SPARCSPARC
YonseiYonsei UniversityUniversity12-71
Instruction SetInstruction Set• Available ALU operations
– Integer addition (with or without carry)– Integer subtraction (with or without carry)– Bitwise Boolean AND, OR, XOR and their
negations– Shift left logical, right logical or right arithmetic
SPARCSPARC
YonseiYonsei UniversityUniversity12-72
SPARC Instruction SetSPARC Instruction Set SPARCSPARC
YonseiYonsei UniversityUniversity12-73
SynthesizingSynthesizing• Synthesizing other addressing modes with
SPARC addressing modes
SPARCSPARC
YonseiYonsei UniversityUniversity12-74
SPARC Instruction FormatSPARC Instruction Format SPARCSPARC
YonseiYonsei UniversityUniversity12-75
ControversyControversy• Quantitative
– compare program sizes and execution speeds
• Qualitative– examine issues of high level language support
and use of VLSI real estate
• Problems– No pair of RISC and CISC that are directly
comparable– No definitive set of test programs– Difficult to separate hardware effects from
complier effects– Most comparisons done on “toy” rather than
production machines– Most commercial devices are a mixture
ControversyControversy