Post on 21-Dec-2015
transcript
15-447 Computer Architecture Fall 2007 ©
October 3rd, 2007
Majd F. Sakr
msakr@qatar.cmu.edu
www.qatar.cmu.edu/~msakr/15447-f07/
CS-447– Computer Architecture
M,W 10-11:20am
Lecture 11Single Cycle Datapath
15-447 Computer Architecture Fall 2007 ©
Lecture Objectives
°Learn what a datapath is, and how does it provide the required functions.
°Appreciate why different implementation strategies affects the clock rate and CPI of a machine.
°Understand how the ISA determines many aspects of the hardware implementation.
15-447 Computer Architecture Fall 2007 ©
Implementation vs. Performance
Performance of a processor is determined by
• Instruction count of a program
• CPI
• Clock cycle time (clock rate)
The compiler & the ISA determine the instruction count.
The implementation of the processor determines the CPI and the clock cycle time.
15-447 Computer Architecture Fall 2007 ©
Possible Execution Steps of Any Instructions
° Instruction Fetch
° Instruction Decode and Register Fetch
° Execution of the Memory Reference Instruction
° Execution of Arithmetic-Logical operations
° Branch Instruction
° Jump Instruction
15-447 Computer Architecture Fall 2007 ©
Instruction Processing° Five steps:
• Instruction fetch (IF)
• Instruction decode and operand fetch (ID)
• ALU/execute (EX)
• Memory (not required) (MEM)
• Write-back (WB)
Registers
Register #
Data
Register #
Datamemory
Address
Data
Register #
PC Instruction ALU
Instructionmemory
Address
IF
ID
EX
MEM
WB
15-447 Computer Architecture Fall 2007 ©
Datapath & Control
Control
15-447 Computer Architecture Fall 2007 ©
Datapath Elements
The data path contains 2 types of logic elements:
• Combinational: (e.g. ALU) Elements that operate on data values. Their outputs depend on their inputs.
• State: (e.g. Registers & Memory) Elements with internal storage. Their state is defined by the values they contain.
15-447 Computer Architecture Fall 2007 ©
State Elements
15-447 Computer Architecture Fall 2007 ©
Pentium Processor Die
° State
• Registers
• Memory
° Control ROM
° Combinational logic (Compute)
REG
15-447 Computer Architecture Fall 2007 ©
Abstract View of the Datapath
Registers
Register #
Data
Register #
Datamemory
Address
Data
Register #
PC Instruction ALU
Instructionmemory
Address
15-447 Computer Architecture Fall 2007 ©
Single Cycle Implementation
° This simple processor can compute ALU instructions, access memory or compute the next instruction's address in a single cycle.
15-447 Computer Architecture Fall 2007 ©
Program Counter
If each instruction needs 4 memory locations then, Next PC <= PC + 4
15-447 Computer Architecture Fall 2007 ©
PC Datapath – Branch OffsetPC <= PC + Branch Offset
15-447 Computer Architecture Fall 2007 ©
Abstract View After PC Basic Implementation
15-447 Computer Architecture Fall 2007 ©
The Register File
° Arithmetic & Logical instructions (R-type), read the contents of 2 registers, perform an ALU operation, and write the result back to a register.
° Registers are stored in the register file. The register file has inputs to specify the registers, outputs for the data read, input for the data written and 1 control signal to decide if data should be written in. In addition we will need an ALU to perform the operations.
15-447 Computer Architecture Fall 2007 ©
The Register File
InstructionRegisters
Writeregister
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Writedata
ALUresult
ALU
Zero
RegWrite
ALU operation3
15-447 Computer Architecture Fall 2007 ©
R-Type Instructions•Assembly (e.g., register-register signed addition)
ADD rdreg rsreg rtreg
• Machine encoding
• Semantics
if MEM[PC] == ADD rd rs rtGPR[rd] ← GPR[rs] + GPR[rt]
PC ← PC + 4
15-447 Computer Architecture Fall 2007 ©
ADD rd rs rt
15-447 Computer Architecture Fall 2007 ©
Datapath for Add
15-447 Computer Architecture Fall 2007 ©
I-Type ALU Instructions
° Assembly (e.g., register-immediate signed additions)
ADDI rtreg rsreg immediate16
° Machine encoding
° Semantics
if MEM[PC] == ADDI rt rs immediate
GPR[rt] ← GPR[rs] + sign-extend (immediate)
PC ← PC + 4
15-447 Computer Architecture Fall 2007 ©
ADDI rtreg rsreg immediate16
15-447 Computer Architecture Fall 2007 ©
Datapath for R and I-Type ALU Instructions
15-447 Computer Architecture Fall 2007 ©
Data Memory
° The element needed to implement load and store
instructions are data memory. In addition we use
the existing ALU to compute the address to
access.
° The data memory has 2 x-bit inputs: the address
and the write data, and 1 x-output: the read data.
In addition it has 2 control lines:
MemWrite and MemRead.
15-447 Computer Architecture Fall 2007 ©
Data Memory
Instruction
16 32
RegistersWriteregister
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Datamemory
Writedata
Readdata
Writedata
Signextend
ALUresult
ZeroALU
Address
MemRead
MemWrite
RegWrite
ALU operation3
15-447 Computer Architecture Fall 2007 ©
Load Instruction° Assembly (e.g., load 4-byte word)
LW rtreg offset16 (basereg)
° Machine encoding
° Semantics
if MEM[PC]==LW rt offset16 (base)
EA = sign-extend(offset) + GPR[base]
GPR[rt] ← MEM[ translate(EA) ]
PC ← PC + 4
15-447 Computer Architecture Fall 2007 ©
LW Datapath
15-447 Computer Architecture Fall 2007 ©
Branch Equal
°The beq (branch if equal) instruction has 3 operands two registers that are compared for equality and a n-bit offset used to compute the branch address relative to the PC.
15-447 Computer Architecture Fall 2007 ©
Branch Equal
16 32Sign
extend
ZeroALU
Sum
Shiftleft 2
To branchcontrol logic
Branch target
PC + 4 from instruction datapath
Instruction
Add
RegistersWriteregister
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Writedata
RegWrite
ALU operation3
15-447 Computer Architecture Fall 2007 ©
Unconditional Jump° Assembly
J immediate26
° Machine encoding
° Semantics
if MEM[PC]==J immediate26
target = { PC[31:28], immediate26, 2’b00 }
PC ← target
15-447 Computer Architecture Fall 2007 ©
Unconditional Jump Datapath
15-447 Computer Architecture Fall 2007 ©
Combining ALU and Memory Instructions
° The ALU datapath and the Memory datapath are similar. The differences are:
• The second input to the ALU is a register (R-type) or the offset (I-type).
• The value stored into the destination register comes from the ALU (R-type) or from memory (I-type) .
° Using 2 multiplexers (Mux) we can combine both datapaths.
15-447 Computer Architecture Fall 2007 ©
Combining ALU and Memory Instructions
Instruction
16 32
RegistersWriteregister
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Datamemory
Writedata
Readdata
Mux
MuxWrite
data
Signextend
ALUresult
ZeroALU
Address
RegWrite
ALU operation3
MemRead
MemWrite
ALUSrcMemtoReg
15-447 Computer Architecture Fall 2007 ©
The Complete Datapath
PC
Instructionmemory
Readaddress
Instruction
16 32
Add ALUresult
Mux
Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1Readregister 2
Shiftleft 2
4
Mux
ALU operation3
RegWrite
MemRead
MemWrite
PCSrc
ALUSrc
MemtoReg
ALUresult
ZeroALU
Datamemory
Address
Writedata
Readdata M
ux
Signextend
Add
15-447 Computer Architecture Fall 2007 ©
Complete Datapath
15-447 Computer Architecture Fall 2007 ©
What’s Wrong with Single Cycle?
° All instructions run at the speed of the slowest instruction.
° Adding a long instruction can hurt performance• What if you wanted to include multiply?
° You cannot reuse any parts of the processor• We have 3 different adders to calculate PC+1,
PC+1+offset and the ALU
° No profit in making the common case fast• Since every instruction runs at the slowest instruction
speed- This is particularly important for loads as we will see later
15-447 Computer Architecture Fall 2007 ©
What’s Wrong with Single Cycle?
1 ns – Register read/write time
2 ns – ALU/adder
2 ns – memory access
0 ns – MUX, PC access, sign extend, ROM
add: 2ns + 1ns + 2ns + 1ns = 6 ns
beq: 2ns + 1ns + 2ns = 5 ns
sw: 2ns + 1ns + 2ns + 2ns = 7 ns
lw: 2ns + 1ns + 2ns + 2ns + 1ns = 8 ns
Get read ALU mem writeInstr reg operation reg
15-447 Computer Architecture Fall 2007 ©
Computing Execution Time
Assume: 100 instructions executed25% of instructions are loads,
10% of instructions are stores,
45% of instructions are adds, and
20% of instructions are branches.
Single-cycle execution:
100 * 8ns = 800 ns
Optimal execution:
25*8ns + 10*7ns + 45*6ns + 20*5ns = 640 ns