CS110ComputerArchitecture
Lecture10:Datapath
Instructor:SörenSchwertfeger
http://shtech.org/courses/ca/
School of Information Science and Technology SIST
ShanghaiTech University
1Slides based on UC Berkley's CS61C
Review
• TimingconstraintsforFiniteStateMachines– Setuptime,HoldTime,ClocktoQtime
• Usemuxes toselectamonginputs– Scontrolbitsselectsfrom2S inputs– Eachinputcanben-bitswide,independentofS– Canimplementmuxes hierarchically
• ALUcanbeimplementedusingamux– Coupledwithbasicblockelements– Adder/Substractor &AND&OR&shift
2
Processor
Control
Datapath
ComponentsofaComputer
3
PC
Registers
Arithmetic&LogicUnit(ALU)
MemoryInput
Output
Bytes
Enable?Read/Write
Address
WriteData
ReadData
Processor-MemoryInterface I/O-MemoryInterfaces
Program
Data
TheCPU• Processor(CPU):theactivepartofthecomputerthatdoesallthework(datamanipulationanddecision-making)
• Datapath:portionoftheprocessorthatcontainshardwarenecessarytoperformoperationsrequiredbytheprocessor
• Control:portionoftheprocessor(alsoinhardware)thattellsthedatapath whatneedstobedone
4
One-Instruction-Per-CycleRISC-VMachine• Oneclocktick=>
oneinstruction
• Currentstateoutputs=>inputstocombinationallogic=>outputssettleatthevaluesofstatebeforenextclockedge
• Risingclockedge:– allstateelements
areupdatedwithcombinationallogicoutputs
– executionmovestonextclockcycle
Registers
PC
Instr.Mem
DataMem
CombinationalLogic
clock
5
DatapathandControl• Datapath designedtosupportdatatransfersrequiredbyinstructions
• Controllercausescorrecttransferstohappen
Controlleropcode, funct
inst
ruct
ion
mem
ory
+4
rtrsrd
regi
ster
sALU
Dat
am
emor
y
imm
PC
6
StagesoftheDatapath :Overview
• Problem:asingle,“monolithic”blockthat“executesaninstruction”(performsallnecessaryoperationsbeginningwithfetchingtheinstruction)wouldbetoobulkyandinefficient
• Solution:breakuptheprocessof“executinganinstruction”intostages,andthenconnectthestagestocreatethewholedatapath– smallerstagesareeasiertodesign– easytooptimize(change)onestagewithouttouchingtheothers(modularity)
7
FiveStagesofInstructionExecution• Stage1:InstructionFetch(IF)
• Stage2:InstructionDecode(ID)
• Stage3:Execute(EX):ALU(Arithmetic-LogicUnit)
• Stage4:MemoryAccess(MEM)
• Stage5:RegisterWrite(WB)
8
StagesofExecutiononDatapath
inst
ruct
ion
mem
ory
+4
rtrsrd
regi
ster
s
ALU
Dat
am
emor
y
imm
1.InstructionFetch
2.Decode/RegisterRead
3.Execute 4.Memory 5.RegisterWrite
PC
9
StagesofExecution(1/5)
• ThereisawidevarietyofRISC-Vinstructions:sowhatgeneralstepsdotheyhaveincommon?
• Stage1:InstructionFetch– nomatterwhattheinstruction,the32-bitinstructionwordmustfirstbefetchedfrommemory(thecache-memoryhierarchy)
– also,thisiswhereweIncrementPC(thatis,PC=PC+4,topointtothenextinstruction:byteaddressingso+4)
10
StagesofExecution(2/5)• Stage2:InstructionDecode
– uponfetchingtheinstruction,wenextgatherdatafromthefields(decodeallnecessaryinstructiondata)
– first,readtheopcode todetermineinstructiontypeandfieldlengths
– second,readindatafromallnecessaryregisters• foradd,readtworegisters• foraddi,readoneregister
– third,generatetheimmediates11
StagesofExecution(3/5)• Stage3:ALU(Arithmetic-LogicUnit)
– therealworkofmostinstructionsisdonehere:arithmetic(+,-,*,/),shifting,logic(&,|)
– whataboutloadsandstores?• lw t0,40(t1)• theaddressweareaccessinginmemory=thevalueint1 PLUSthevalue40
• sowedothisadditioninthisstage– alsodoesstuffforotherinstructions…
12
StagesofExecution(4/5)
• Stage4:MemoryAccess– actuallyonlytheloadandstoreinstructionsdoanythingduringthisstage;theothersremainidleduringthisstageorskipitalltogether
– sincetheseinstructionshaveauniquestep,weneedthisextrastagetoaccountforthem
– asaresultofthecachesystem,thisstageisexpectedtobefast
13
StagesofExecution(5/5)
• Stage5:RegisterWrite– mostinstructionswritetheresultofsomecomputationintoaregister
– examples:arithmetic,logical,shifts,loads,jumps– whataboutstores,branches?
• don’twriteanythingintoaregisterattheend• theseremainidleduringthisfifthstageorskipitalltogether
14
StagesofExecutiononDatapath
inst
ruct
ion
mem
ory
+4
rtrsrd
regi
ster
s
ALU
Dat
am
emor
y
imm
1.InstructionFetch
2.Decode/RegisterRead
3.Execute 4.Memory 5.RegisterWrite
PC
15
• CombinationalElements
• StorageElements+ClockingMethodology• BuildingBlocks
Datapath Components:Combinational
32A
B32
Y32
Select
MUX
Multiplexer
32
32
A
B32
Result
OP
ALU
ALU
32
32
A
B32 Sum
CarryOut
CarryIn
Adder
Adder
16
Datapath Elements:StateandSequencing(1/3)
• Register
• WriteEnable:– Negated(ordeasserted)(0):DataOutwillnotchange
– Asserted(1):DataOutwillbecomeDataInonpositiveedgeofclock
clk
DataIn
WriteEnable
N N
DataOut
17
• Registerfile(regfile,RF)consistsof32registers– Two32-bitoutputbusses:busA andbusB– One32-bitinputbus:busW– Inoneclockcyclecanreadtworegisters
andwriteanother!
• Registerisselectedby:– RA(number)selectstheregistertoputonbusA (data)– RB(number)selectstheregistertoputonbusB (data)– RW(number)selectstheregistertobewritten
viabusW (data)whenWriteEnableis1
• Clockinput(clk)– Clk inputisafactorONLYduringwriteoperation– Duringreadoperation,behavesasacombinationallogicblock:
• RAorRBvalidÞ busA orbusB validafter“accesstime.”
Clk
busW
WriteEnable
3232
busA
32busB
5 5 5RW RA RB
32x32-bitRegisters
Datapath Elements:StateandSequencing(2/3)
18
• “Magic”Memory– Oneinputbus:DataIn– Oneoutputbus:DataOut
• Memorywordisfoundby:– ForRead:AddressselectsthewordtoputonDataOut– ForWrite:SetWriteEnable=1:addressselectsthememorywordtobewrittenviatheDataInbus
• Clockinput(CLK)– CLKinputisafactorONLYduringwriteoperation– Duringreadoperation,behavesasacombinationallogicblock:AddressvalidÞ DataOutvalidafter“accesstime”
Clk
DataIn
WriteEnable
32 32DataOut
Address
Datapath Elements:StateandSequencing(1/3)
19
StateRequiredbyRV32IISAEachinstructionreadsandupdatesthisstateduringexecution:• Registers(x0..x31)
– Registerfile(regfile)Reg holds32registersx32bits/register:Reg[0]..Reg[31]– Firstregisterreadspecifiedbyrs1fieldininstruction– Secondregisterreadspecifiedbyrs2fieldininstruction– Writeregister(destination)specifiedbyrd fieldininstruction– x0 isalways0(writestoReg[0]areignored)
• ProgramCounter(PC)– Holdsaddressofcurrentinstruction
• Memory(MEM)– Holdsbothinstructions&data,inone32-bitbyte-addressedmemoryspace– We’lluseseparatememoriesforinstructions(IMEM)anddata(DMEM)
• Theseareplaceholdersforinstructionanddatacaches– Instructionsareread(fetched)frominstructionmemory(assumeIMEM read-only)– Load/storeinstructionsaccessdatamemory
20
Review:CompleteRV32IISA
• Needdatapath andcontroltoimplementtheseinstructions
NotinCA
21
Implementingtheadd instruction
add rd, rs1, rs2
• Instructionmakestwochangestomachine’sstate:– Reg[rd] = Reg[rs1] + Reg[rs2]– PC = PC + 4
0000000 rs2 rs1 000 rd 0110011
Reg-Reg OPrdaddadd rs2 rs1
7 5 5 3 75
31 25 20 15 71224 19 14 11 6 0funct7 rs2 rs1 funct3 rd opcode
22
Datapath foradd
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWriteEnable (RegWEn)=1
add 5 5 add Reg-Reg OP5
31 25 20 15 71224 19 14 11 6 00000000 rs2 rs1 000 rd opcode
Reg[rd] = Reg[rs1] + Reg[rs2]PC = PC + 4
23
TimingDiagramforadd
1000 1004PC
1004 1008PC+4
add x1,x2,x3 add x6,x7,x9inst[31:0]
Clock
time
Reg[2] Reg[7]Reg[rs1]
Reg[2]+Reg[3]alu Reg[7]+Reg[9]
Reg[3] Reg[9]Reg[rs2]
???Reg[1] Reg[2]+Reg[3]
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst[31:0]clock RegWEn
24
Implementingthesub instruction
sub rd, rs1, rs2
• Almostthesameasadd,exceptnowhavetosubtractoperandsinsteadofaddingthem
• inst[30] selectsbetweenaddandsubtract
31 25 20 15 71224 19 14 11 6 00000000 rs2 rs1 000 rd 01100110100000 rs2 rs1 000 rd 0110011
addsub
25
Datapath foradd/sub
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Control logic
RegWEn(1=Write, 0=NoWrite)
ALUSel(add=0/sub=1)
Inst[31:0]
26
ImplementingotherR-Formatinstructions
• Allimplementedbydecodingfunct3andfunct7fieldsandselectingappropriateALUfunction
0000000 rs2 rs1 000 rd 0110011
0100000 rs2 rs1 000 rd 0110011
0000000 rs2 rs1 001 rd 0110011
addsubsll
0000000 rs2 rs1 010 rd 0110011 slt0000000 rs2 rs1 011 rd 0110011
0000000 rs2 rs1 100 rd 0110011 xor0000000 rs2 rs1 101 rd 0110011 srl0100000 rs2 rs1 101 rd 0110011 sra0000000 rs2 rs1 110 rd 01100110000000 rs2 rs1 111 rd 0110011
orand
sltu
27
ImplementingI-Format- addiinstruction
• RISC-VAssemblyInstruction:addi x15,x1,-50
111111001110 00001 000 01111 0010011
OP-Immrd=15addimm=-50 rs1=1
5 3 75
31 20 15 71219 14 11 6 0rs1 funct3 rd opcodeimm[11:0]
12
28
Datapath foradd/sub
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWEn(1=Write, 0=NoWrite)
ALUSel(add=0/sub=1)
Immediate shouldbe here
29
Addingaddi toDatapath
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWEn(1=Write, 0=NoWrite)
ALUSel(add=0/sub=1)
BSel(rs2=0/Imm=1)
0
1
Imm[31:0]
30
Addingaddi toDatapath
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWEn=1 ALUSel=add
BSel(rs2=0/Imm=1)
Bsel = 1
ImmSel=I
0
1
Imm[31:0]Imm.Gen
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[31:20]
31
I-Formatimmediates
inst[31:0]
------inst[31]-(sign-extension)------- inst[30:20]
imm[31:0]
Imm.Gen
inst[31:20] imm[31:0]
ImmSel=I
• High 12 bits of instruction (inst[31:20]) copied to low 12 bits of immediate (imm[11:0])
• Immediate is sign-extended by copying value of inst[31] to fill the upper 20 bits of the immediate value (imm[31:12])
-inst[31]-31 30 20 15 71219 14 11 6 0
rs1 funct3 rd opcodeimm[11:0]12
32
R+I Datapath
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst<31:0>
Control logic
RegWEn ALUSelBSel
0
1
Imm[31:0]Imm.Gen
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[31:20]
Works for all other I-format arithmetic instructions (slti,sltiu,andi,ori,xori,slli,srli, srai) just by changing ALUSel
ImmSel
33
PeerInstruction
1)Programcounterisaregister2)We shouldusethemainALUtocomputePC=PC+4inordertosavesomegates
3)TheALUisasynchronousstateelement
123A: FFFB: FFTC: FTFD: FTTE: TFFF: TFTG: TTFH: TTT
34
Addlw• RISC-VAssemblyInstruction(I-type): lw x14, 8(x2)
5 3 75
31 20 15 71219 14 11 6 0rs1 funct3 rd opcodeimm[11:0]
12offset[11:0] base width dest LOAD
31 20 15 71219 14 11 6 000010 010 01110 0000011000000001000
imm= +8 rs1=2 LW rd=14 LOAD
• The 12-bit signed immediate is added to the base address in register rs1 to form the memory address
• This is very similar to the add-immediate operation but used to create address not to create final result
• The value loaded from memory is stored in register rd35
Addinglw toDatapath
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWEn=1
ALUSel=Add
Bsel=1
WBSel=0
MemRW=Read
0
1
Imm[31:0]Imm.Gen
+4Add
clk
addrinst
IMEM DMEM
addrDataR
PCpc+4
Inst[31:20]
1
0
clk
ImmSel=I
mem
wb
pc
36
AllRV32LoadInstructions
• Supportingthenarrowerloadsrequiresadditionallogictoextractthecorrectbyte/halfword fromthevalueloadedfrommemory,andsign- orzero-extendtheresultto32bitsbeforewritingbacktoregisterfile.– Itisjustamuxmod
funct3 field encodes size and ‘signedness’ of load data
imm[11:0] rs1 000 rd 0000011
imm[11:0] rs1 001 rd 0000011
imm[11:0] rs1 010 rd 0000011
lblhlw
imm[11:0] rs1 100 rd 0000011 lbuimm[11:0] rs1 101 rd 0000011 lhu
37
Addingsw Instruction• sw:Readstworegisters,rs1forbasememoryaddress,andrs2fordatatobestored,aswellimmediateoffset!sw x14, 8(x2)
0000000 01110 00010 010 01000 0100011
combined 12-bit offset = 80000000 01000
7 5 5 3 75
31 25 20 15 71224 19 14 11 6 0Imm[11:5] rs2 rs1 funct3 imm[4:0] opcode
offset[11:5] base widthsrc STOREoffset[4:0]
STOREoffset[4:0]=8
SWoffset[11:5]=0
rs2=14 rs1=2
38
Datapath withlw
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWEn ALUSelBSel MemRW
0
1
Imm[31:0]Imm.Gen
+4Add
clk
addrinst
IMEM DMEM
addrDataR
PCpc+4
Inst[31:20]
1
0
clk
WBSelImmSel
mem
wb
pc
39
Addingsw toDatapath
+4Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20] ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWEn=0
ALUSel=Add
Bsel=1
MemRW=Write
0
1
Imm[31:0]Imm.Gen
+4Add
clk
addrinst
IMEM DMEM
addrDataR
DataW
PCpc+4
Inst[31:7]
1
0
clk
WBSel=*(*=Don’t care)
ImmSel=S
mempc
+4Add
clk
addrinst
IMEM DMEM
addrDataR
DataW
PCpc+4
wb
40
I+SImmediateGeneration
inst[31:0]
SI
1 65 5
• Just need a 5-bit mux to select between two positions where low five bits of immediate can reside in instruction
• Other bits in immediate are wired to fixed positions in instruction
imm[11:5] rs2 rs1 funct3 imm[4:0] S-opcode
25 2431 20 15 71219 14 11 6 0rs1 funct3 rd I-opcodeimm[11:0]
SI
inst[24:20]inst[31] (sign extension) Iinst[30:25]inst[11:7]inst[31] (sign extension) inst[30:25] S
31 511 10 4 0
I/S
imm[31:0]
41
ImplementingBranches
• B-formatismostlysameasS-Format,withtworegistersources(rs1/rs2)anda12-bitimmediate
• Butnowimmediaterepresentsvalues-4096to+4094in2-byteincrements
• The12immediatebitsencodeeven 13-bitsignedbyteoffsets(lowestbitofoffsetisalwayszero,sononeedtostoreit)
1 6 5 3 74
31 30 24 15 71225 20 14 11 6 0imm[12] rs2 rs1 funct3 imm[4:1] opcodeimm[10:5] imm[11]
19 8
5 1
BRANCHoffset[12|10:5] rs1 funct3rs2 offset[4:1|11]
42
Datapath SoFar
+4Add
addrinst
IMEM
pc+4Inst[24:20] ALU
+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
alu
Reg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWEn ALUSelBsel MemRW
0
1
Imm[31:0]Imm.Gen
Add
clk
addrinst
IMEM DMEM
addrDataR
DataW
PC
Inst[31:7]
1
0
clk
WBSelImmSel
mem
wb
pc
43
Branches
• Differentchangetothestate:
– PC =
• Sixbranchinstructions:BEQ, BNE, BLT, BGE, BLTU, BGEU
• NeedtocomputePC + immediate andtocomparevaluesofrs1 and rs2– ButhaveonlyoneALU– needmorehardware
PC + 4, branch not takenPC + immediate, branch taken
44
AddingBranches
+4Add
addrinst
IMEM
pc+4Inst[24:20] ALU
+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
alu
Reg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWEn=0
ALUSel=Add
Bsel=1
MemRW=ReadAsel
=1
0
1
Imm[31:0]Imm.Gen
Add
clk
addrinst
IMEM DMEM
addrDataR
DataW
PC
Inst[31:7]
1
0
clk
WBSel=*(*=Don’t care)
BranchComp
1
0
ImmSel=B
1
0
PCSel=taken/not taken
BrUnBrEq
BrLT
mem
wb
pc
45
BranchComparator
• BrEq =1,ifA=B
• BrLT =1,ifA<B
• BrUn =1selectsunsignedcomparisonforBrLT,0=signed
• BGEbranch:A>=B,ifA<B
A<B=!(A<B)
BranchComp
A
B
BrU BrLTBrEq
46
BranchImmediates (InOtherISAs)• 12-bitimmediateencodesPC-relativeoffsetof-4096to+4094bytes
inmultiplesof2bytes• Standardapproach:Treatimmediateasinrange-2048..+2047,then
shiftleftby1bittomultiplyby2forbranches
s rs2 rs1 funct3 imm[4:0] B-opcodeimm[10:5]
s imm[10:5] imm[4:0]
s imm[10:5] imm[4:0] 0
sign-extension
sign-extension
S-Immediate
B-Immediate(shiftleftby1)
Each instruction immediate bit can appear in one of two places in output immediate value – so need one 2-way mux per bit
47