David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and...

Chapter 6 <1> Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <1>

Digital Design and Computer Architecture, RISC-V Edition

Chapter 6

David M. Harris and Sarah L. Harris

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <2>

Chapter 7 :: Microarchitecture

• Introduction• Performance Analysis• Single-Cycle Processor• Multicycle Processor• Pipelined Processor• Advanced Microarchitecture


Review: Single-Cycle RISC-V Processor

ImmExt

CLK

A RDInstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Extend

RegisterFile

01

A RDData

MemoryWD

WEPC0

1

PCTarget

Instr

31:7

6:0

SrcB

ALUResult ReadData

WriteData

SrcA

14:12

MemWrite

ALUSrc

RegWrite

funct3op

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

ImmSrc1:0

ResultSrc

+

PCPlus4

PCNext

funct7530

Zero

01

Result

19:15

24:20

11:7


Review: Multicycle RISC-V Processor

ImmExt

CLK

ARD

Instr / DataMemory

PC 01

Instr

SrcB

ALUResult

SrcA

ALUOut

MemWrite

ALUSrcA1:0

RegWrite

Zero

ResultSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

A

WriteD

ata

4

CLK

EN

ALUSrcB1:0

IRWrite

AdrSrcPCWrite

ReadD

ata

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

19:15

11:7

31:7

24:20 000110

Result

14:12

30 funct75funct3

Zero

6:0 op

ControlUnit

ImmSrc1:0

Extend

Rs1

Rs2

CLK

OldPC

Rd

EN

000110

000110

PCNext


Review: Multicycle Main FSM

S1: DecodeALUSrcA = 01ALUSrcB = 01ALUOp = 00

S8: ExecuteIALUSrcA = 10ALUSrcB = 01ALUOp = 10

Reset

S6: ExecuteRALUSrcA = 10ALUSrcB = 00ALUOp = 10

S2: MemAdrALUSrcA = 10ALUSrcB = 01ALUOp = 00

op = 0000011 (lw)OR

op = 0100011 (sw)

op = 0000011

(lw)

op = 0100011

(sw)

op = 0110011(R-type)

op = 0010011 (I-type ALU)

op = 1101111

(jal)

op = 1100011 (beq)

S9: JALALUSrcA = 01ALUSrcB = 10ALUOp = 00

ResultSrc = 00PCUpdate

S4: MemWBResultSrc = 01

RegWrite

S7: ALUWBResultSrc = 00

RegWrite

S10: BEQALUSrcA = 10ALUSrcB = 00ALUOp = 01

ResultSrc = 00Branch

S0: FetchAdrSrc = 0

IRWriteALUSrcA = 00ALUSrcB =10ALUOp = 00

ResultSrc = 10PCUpdate

S3: MemReadResultSrc = 00

AdrSrc = 1

S5: MemWriteResultSrc = 00

AdrSrc = 1MemWrite

State Datapath µOpFetch Instr ←Mem[PC]; PC ← PC+4Decode ALUOut ← PCTargetMemAdr ALUOut ← rs1 + immMemRead Data ← Mem[ALUOut]MemWB rd ← DataMemWrite Mem[ALUOut] ← rdExecuteR ALUOut ← rs1 op rs2ExecuteI ALUOut ← rs1 op immALUWB rd ← ALUOutBEQ ALUResult = rs1-rs2; if Zero, PC = ALUOutJAL PC = ALUOut; ALUOut = PC+4


• Deep Pipelining• Micro-operations• Branch Prediction• Superscalar Processors• Out of Order Processors• Register Renaming• SIMD• Multithreading• Multiprocessors

Advanced Microarchitecture


• 10-20 stages typical• Number of stages limited by:– Pipeline hazards– Sequencing overhead– Power– Cost

Deep Pipelining


• Decompose more complex instructions into a series of simple instructions called micro-operations (micro-ops or µ-ops)

• At run-time, complex instructions are decoded into one or more micro-ops

• Used heavily in CISC (complex instruction set computer) architectures (e.g., x86)

Complex Op Micro-op Sequencelw s1, 0(s2), postincr 4 lw s1, 0(s2)

addi s2, s2, 4

Without μ-ops, would need 2nd write port on the register file

Micro-operations


• Guess whether branch will be taken– Backward branches are usually taken (loops)– Consider history to improve guess

• Good prediction reduces fraction of branches requiring a flush

Branch Prediction


• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:– Check direction of branch (forward or backward)– If backward, predict taken– Else, predict not taken

• Dynamic branch prediction:– Keep history of last several hundred (or thousand)

branches in branch target buffer, record:• Branch destination• Whether branch was taken

Branch Prediction


addi s1, zero, 0 # s1 = sumaddi s0, zero, 0 # s0 = i

addi t0, zero, 10 # t0 = 10

For: # for (i=0; i<10; i=i+1)

bge s0, t0, Doneadd s1, s1, s0 # sum = sum + i

addi s0, s0, 1 # i = i + 1

j For

Done:

Branch Prediction Example


• Remembers whether branch was taken the last time and does the same thing

• Mispredicts first and last branch of loop

1-Bit Branch Predictor


Only mispredicts last branch of loop

stronglytaken

predicttaken

weaklytaken

predicttaken

weaklynot taken

predictnot taken

stronglynot taken

predictnot taken

taken taken taken

takentakentaken

taken

taken

2-Bit Branch Predictor


• Multiple copies of datapath execute multiple instructions at once

• Dependencies make it tricky to issue multiple instructions at once

CLK CLK CLK CLK

ARD A1

A2RD1A3

WD3WD6

A4A5A6

RD4

RD2RD5

InstructionMemory

RegisterFile Data

Memory

ALUs

PC

CLK

A1A2

WD1WD2

RD1RD2

Superscalar


Ideal IPC: 2Actual IPC: 2

Superscalar Example

Time (cycles)

1 2 3 4 5 6 7 8

RF40

R0

RF

R8+

DMIM

LDR

ADD

LDR R8, [R0, #40]

ADD R9, R1, R2

SUB R10, R1, R3

AND R11, R3, R4

ORR R12, R1, R5

STR R5, [R0, #80]

R9R2

R1

+

RFR3

R1

RF

R10-

DMIM

SUB

AND R11R4

R3

&

RFR5

R1

RF

R12|

DMIM

ORR

STR 80

R0

+ R5


Superscalar with Dependencies

Stall

Time (cycles)

1 2 3 4 5 6 7 8

RF40

R0

RF

R8+

DMIM

LDRLDR R8, [R0, #40]

ADD R9, R8, R1

SUB R8, R2, R3

AND R10, R4, R8

STR R7, [R11, #80]

RFR1

R8ADD

RFR1

R8

RF

R9+

DM

RFR8

R4

RF

R10&

DMIM

AND

IMORR

AND

SUB

|R6

R5R11

RF80

R11

RF+

DMSTR

IM

R7

9

R3

R2

R3

R2-

R8

ORRORR R11, R5, R6

IM

Ideal IPC: 2Actual IPC: 6/5 = 1.2


• Looks ahead across multiple instructions• Issues as many instructions as possible at once• Issues instructions out of order (as long as no

dependencies)• Dependencies:

– RAW (read after write): one instruction writes, later instruction reads a register

– WAR (write after read): one instruction reads, later instruction writes a register

– WAW (write after write): one instruction writes, later instruction writes a register

Out of Order Processor


• Instruction level parallelism (ILP): number of instruction that can be issued simultaneously (average < 3)

• Scoreboard: table that keeps track of:– Instructions waiting to issue–Available functional units–Dependencies

Out of Order Processor


LDR R8, [R0, #40]

ADD R9, R8, R1

SUB R8, R2, R3 Ideal IPC: 2AND R10, R4, R8 Actual IPC: 6/4 = 1.5ORR R11, R5, R6STR R7, [R11, #80]

Out of Order Processor Example

Time (cycles)

1 2 3 4 5 6 7 8

RF40

R0

RF

R8+

DMIM


ADD R9, R8, R1

SUB R8, R2, R3

AND R10, R4, R8

STR R7, [R11, #80]

ORR|R6

R5R11

RF80

R11

RF+

DMSTR R7

ORR R11, R5, R6

IM

RFR1

R8

RF

R9+

DMIM

ADD

SUB-R3

R2R8

two cycle latencybetween load anduse of R8

RAW

WAR

RAW

RFR8

R4

RF&

DMAND

IM

R10

RAW


LDR R8, [R0, #40]

ADD R9, R8, R1

SUB R8, R2, R3 Ideal IPC: 2AND R10, R4, R8 Actual IPC: 6/3 = 2ORR R11, R5, R6

STR R7, [R11, #80]

Register Renaming

Time (cycles)

1 2 3 4 5 6 7

RF40

R0

RF

R8+

DMIM


ADD R9, R8, R1

SUB T0, R2, R3

AND R10, R4, T0

STR R7, [R11, #80]

SUB-R3

R2T0

RFT0

R4

RF&

DMAND

R7

ORR R11, R5, R6IM

RFR1

R8

RF

R9+

DMIM

ADD

STR+80

R11

RAW

R6

R5|

ORR

2-cycle RAW

RAW

R10

R11


• Single Instruction Multiple Data (SIMD)– Single instruction acts on multiple pieces of data at once– Common application: graphics– Perform short arithmetic operations (also called packed

arithmetic)

• For example, add eight 8-bit elements

SIMD

a0

0781516232431 Bit position

D0a1a2a3

b0 D1b1b2b3

a0 + b0 D2a1 + b1a2 + b2a3 + b3

+

a4a5a6a7

b4b5b6b7

a4 + b4a5 + b5a6 + b6a7 + b7

3239404748555663


• Multithreading– Wordprocessor: thread for typing, spell checking,

printing

• Multiprocessors– Multiple processors (cores) on a single chip

Advanced Architecture Techniques


• Process: program running on a computer– Multiple processes can run at once: e.g., surfing

Web, playing music, writing a paper

• Thread: part of a program– Each process has multiple threads: e.g., a word

processor may have threads for typing, spell checking, printing

Threading: Definitions


• One thread runs at once• When one thread stalls (for example, waiting

for memory):– Architectural state of that thread stored– Architectural state of waiting thread loaded into

processor and it runs– Called context switching

• Appears to user like all threads running simultaneously

Threads in Conventional Processor


• Multiple copies of architectural state• Multiple threads active at once:– When one thread stalls, another runs immediately– If one thread can’t keep all execution units busy,

another thread can use them

• Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput

Intel calls this “hyperthreading”

Multithreading


• Multiple processors (cores) with a method of communication between them

• Types:– Homogeneous: multiple cores with shared main

memory– Heterogeneous: separate cores for different tasks (for

example, DSP and CPU in cell phone)– Clusters: each core has own memory system

Multiprocessors

Date post:	17-Aug-2021
Category:	Documents
Upload:	others
View:	14 times
Download:	0 times

David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and...

Documents