Chapter 6 <1> Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <1>
Digital Design and Computer Architecture, RISC-V Edition
Chapter 6
David M. Harris and Sarah L. Harris
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <2>
Chapter 7 :: Microarchitecture
• Introduction• Performance Analysis• Single-Cycle Processor• Multicycle Processor• Pipelined Processor• Advanced Microarchitecture
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <3>
Review: Single-Cycle RISC-V Processor
ImmExt
CLK
A RDInstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Extend
RegisterFile
01
A RDData
MemoryWD
WEPC0
1
PCTarget
Instr
31:7
6:0
SrcB
ALUResult ReadData
WriteData
SrcA
14:12
MemWrite
ALUSrc
RegWrite
funct3op
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
ImmSrc1:0
ResultSrc
+
PCPlus4
PCNext
funct7530
Zero
01
Result
19:15
24:20
11:7
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <4>
Review: Multicycle RISC-V Processor
ImmExt
CLK
ARD
Instr / DataMemory
PC 01
Instr
SrcB
ALUResult
SrcA
ALUOut
MemWrite
ALUSrcA1:0
RegWrite
Zero
ResultSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A
WriteD
ata
4
CLK
EN
ALUSrcB1:0
IRWrite
AdrSrcPCWrite
ReadD
ata
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
19:15
11:7
31:7
24:20 000110
Result
14:12
30 funct75funct3
Zero
6:0 op
ControlUnit
ImmSrc1:0
Extend
Rs1
Rs2
CLK
OldPC
Rd
EN
000110
000110
PCNext
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <5>
Review: Multicycle Main FSM
S1: DecodeALUSrcA = 01ALUSrcB = 01ALUOp = 00
S8: ExecuteIALUSrcA = 10ALUSrcB = 01ALUOp = 10
Reset
S6: ExecuteRALUSrcA = 10ALUSrcB = 00ALUOp = 10
S2: MemAdrALUSrcA = 10ALUSrcB = 01ALUOp = 00
op = 0000011 (lw)OR
op = 0100011 (sw)
op = 0000011
(lw)
op = 0100011
(sw)
op = 0110011(R-type)
op = 0010011 (I-type ALU)
op = 1101111
(jal)
op = 1100011 (beq)
S9: JALALUSrcA = 01ALUSrcB = 10ALUOp = 00
ResultSrc = 00PCUpdate
S4: MemWBResultSrc = 01
RegWrite
S7: ALUWBResultSrc = 00
RegWrite
S10: BEQALUSrcA = 10ALUSrcB = 00ALUOp = 01
ResultSrc = 00Branch
S0: FetchAdrSrc = 0
IRWriteALUSrcA = 00ALUSrcB =10ALUOp = 00
ResultSrc = 10PCUpdate
S3: MemReadResultSrc = 00
AdrSrc = 1
S5: MemWriteResultSrc = 00
AdrSrc = 1MemWrite
State Datapath µOpFetch Instr ←Mem[PC]; PC ← PC+4Decode ALUOut ← PCTargetMemAdr ALUOut ← rs1 + immMemRead Data ← Mem[ALUOut]MemWB rd ← DataMemWrite Mem[ALUOut] ← rdExecuteR ALUOut ← rs1 op rs2ExecuteI ALUOut ← rs1 op immALUWB rd ← ALUOutBEQ ALUResult = rs1-rs2; if Zero, PC = ALUOutJAL PC = ALUOut; ALUOut = PC+4
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <6>
• Deep Pipelining• Micro-operations• Branch Prediction• Superscalar Processors• Out of Order Processors• Register Renaming• SIMD• Multithreading• Multiprocessors
Advanced Microarchitecture
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <7>
• 10-20 stages typical• Number of stages limited by:– Pipeline hazards– Sequencing overhead– Power– Cost
Deep Pipelining
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <8>
• Decompose more complex instructions into a series of simple instructions called micro-operations (micro-ops or µ-ops)
• At run-time, complex instructions are decoded into one or more micro-ops
• Used heavily in CISC (complex instruction set computer) architectures (e.g., x86)
Complex Op Micro-op Sequencelw s1, 0(s2), postincr 4 lw s1, 0(s2)
addi s2, s2, 4
Without μ-ops, would need 2nd write port on the register file
Micro-operations
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <9>
• Guess whether branch will be taken– Backward branches are usually taken (loops)– Consider history to improve guess
• Good prediction reduces fraction of branches requiring a flush
Branch Prediction
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <10>
• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:– Check direction of branch (forward or backward)– If backward, predict taken– Else, predict not taken
• Dynamic branch prediction:– Keep history of last several hundred (or thousand)
branches in branch target buffer, record:• Branch destination• Whether branch was taken
Branch Prediction
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <11>
addi s1, zero, 0 # s1 = sumaddi s0, zero, 0 # s0 = i
addi t0, zero, 10 # t0 = 10
For: # for (i=0; i<10; i=i+1)
bge s0, t0, Doneadd s1, s1, s0 # sum = sum + i
addi s0, s0, 1 # i = i + 1
j For
Done:
Branch Prediction Example
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <12>
• Remembers whether branch was taken the last time and does the same thing
• Mispredicts first and last branch of loop
1-Bit Branch Predictor
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <13>
Only mispredicts last branch of loop
stronglytaken
predicttaken
weaklytaken
predicttaken
weaklynot taken
predictnot taken
stronglynot taken
predictnot taken
taken taken taken
takentakentaken
taken
taken
2-Bit Branch Predictor
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <14>
• Multiple copies of datapath execute multiple instructions at once
• Dependencies make it tricky to issue multiple instructions at once
CLK CLK CLK CLK
ARD A1
A2RD1A3
WD3WD6
A4A5A6
RD4
RD2RD5
InstructionMemory
RegisterFile Data
Memory
ALUs
PC
CLK
A1A2
WD1WD2
RD1RD2
Superscalar
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <15>
Ideal IPC: 2Actual IPC: 2
Superscalar Example
Time (cycles)
1 2 3 4 5 6 7 8
RF40
R0
RF
R8+
DMIM
LDR
ADD
LDR R8, [R0, #40]
ADD R9, R1, R2
SUB R10, R1, R3
AND R11, R3, R4
ORR R12, R1, R5
STR R5, [R0, #80]
R9R2
R1
+
RFR3
R1
RF
R10-
DMIM
SUB
AND R11R4
R3
&
RFR5
R1
RF
R12|
DMIM
ORR
STR 80
R0
+ R5
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <16>
Superscalar with Dependencies
Stall
Time (cycles)
1 2 3 4 5 6 7 8
RF40
R0
RF
R8+
DMIM
LDRLDR R8, [R0, #40]
ADD R9, R8, R1
SUB R8, R2, R3
AND R10, R4, R8
STR R7, [R11, #80]
RFR1
R8ADD
RFR1
R8
RF
R9+
DM
RFR8
R4
RF
R10&
DMIM
AND
IMORR
AND
SUB
|R6
R5R11
RF80
R11
RF+
DMSTR
IM
R7
9
R3
R2
R3
R2-
R8
ORRORR R11, R5, R6
IM
Ideal IPC: 2Actual IPC: 6/5 = 1.2
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <17>
• Looks ahead across multiple instructions• Issues as many instructions as possible at once• Issues instructions out of order (as long as no
dependencies)• Dependencies:
– RAW (read after write): one instruction writes, later instruction reads a register
– WAR (write after read): one instruction reads, later instruction writes a register
– WAW (write after write): one instruction writes, later instruction writes a register
Out of Order Processor
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <18>
• Instruction level parallelism (ILP): number of instruction that can be issued simultaneously (average < 3)
• Scoreboard: table that keeps track of:– Instructions waiting to issue–Available functional units–Dependencies
Out of Order Processor
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <19>
LDR R8, [R0, #40]
ADD R9, R8, R1
SUB R8, R2, R3 Ideal IPC: 2AND R10, R4, R8 Actual IPC: 6/4 = 1.5ORR R11, R5, R6STR R7, [R11, #80]
Out of Order Processor Example
Time (cycles)
1 2 3 4 5 6 7 8
RF40
R0
RF
R8+
DMIM
LDRLDR R8, [R0, #40]
ADD R9, R8, R1
SUB R8, R2, R3
AND R10, R4, R8
STR R7, [R11, #80]
ORR|R6
R5R11
RF80
R11
RF+
DMSTR R7
ORR R11, R5, R6
IM
RFR1
R8
RF
R9+
DMIM
ADD
SUB-R3
R2R8
two cycle latencybetween load anduse of R8
RAW
WAR
RAW
RFR8
R4
RF&
DMAND
IM
R10
RAW
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <20>
LDR R8, [R0, #40]
ADD R9, R8, R1
SUB R8, R2, R3 Ideal IPC: 2AND R10, R4, R8 Actual IPC: 6/3 = 2ORR R11, R5, R6
STR R7, [R11, #80]
Register Renaming
Time (cycles)
1 2 3 4 5 6 7
RF40
R0
RF
R8+
DMIM
LDRLDR R8, [R0, #40]
ADD R9, R8, R1
SUB T0, R2, R3
AND R10, R4, T0
STR R7, [R11, #80]
SUB-R3
R2T0
RFT0
R4
RF&
DMAND
R7
ORR R11, R5, R6IM
RFR1
R8
RF
R9+
DMIM
ADD
STR+80
R11
RAW
R6
R5|
ORR
2-cycle RAW
RAW
R10
R11
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <21>
• Single Instruction Multiple Data (SIMD)– Single instruction acts on multiple pieces of data at once– Common application: graphics– Perform short arithmetic operations (also called packed
arithmetic)
• For example, add eight 8-bit elements
SIMD
a0
0781516232431 Bit position
D0a1a2a3
b0 D1b1b2b3
a0 + b0 D2a1 + b1a2 + b2a3 + b3
+
a4a5a6a7
b4b5b6b7
a4 + b4a5 + b5a6 + b6a7 + b7
3239404748555663
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <22>
• Multithreading– Wordprocessor: thread for typing, spell checking,
printing
• Multiprocessors– Multiple processors (cores) on a single chip
Advanced Architecture Techniques
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <23>
• Process: program running on a computer– Multiple processes can run at once: e.g., surfing
Web, playing music, writing a paper
• Thread: part of a program– Each process has multiple threads: e.g., a word
processor may have threads for typing, spell checking, printing
Threading: Definitions
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <24>
• One thread runs at once• When one thread stalls (for example, waiting
for memory):– Architectural state of that thread stored– Architectural state of waiting thread loaded into
processor and it runs– Called context switching
• Appears to user like all threads running simultaneously
Threads in Conventional Processor
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <25>
• Multiple copies of architectural state• Multiple threads active at once:– When one thread stalls, another runs immediately– If one thread can’t keep all execution units busy,
another thread can use them
• Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput
Intel calls this “hyperthreading”
Multithreading
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <26>
• Multiple processors (cores) with a method of communication between them
• Types:– Homogeneous: multiple cores with shared main
memory– Heterogeneous: separate cores for different tasks (for
example, DSP and CPU in cell phone)– Clusters: each core has own memory system
Multiprocessors