+ All Categories
Home > Documents > David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and...

David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and...

Date post: 17-Aug-2021
Category:
Upload: others
View: 14 times
Download: 0 times
Share this document with a friend
26
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 Elsevier Chapter 7 <1> Digital Design and Computer Architecture, RISC-V Edition Chapter 6 David M. Harris and Sarah L. Harris
Transcript
Page 1: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Chapter 6 <1> Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <1>

Digital Design and Computer Architecture, RISC-V Edition

Chapter 6

David M. Harris and Sarah L. Harris

Page 2: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <2>

Chapter 7 :: Microarchitecture

• Introduction• Performance Analysis• Single-Cycle Processor• Multicycle Processor• Pipelined Processor• Advanced Microarchitecture

Page 3: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <3>

Review: Single-Cycle RISC-V Processor

ImmExt

CLK

A RDInstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Extend

RegisterFile

01

A RDData

MemoryWD

WEPC0

1

PCTarget

Instr

31:7

6:0

SrcB

ALUResult ReadData

WriteData

SrcA

14:12

MemWrite

ALUSrc

RegWrite

funct3op

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

ImmSrc1:0

ResultSrc

+

PCPlus4

PCNext

funct7530

Zero

01

Result

19:15

24:20

11:7

Page 4: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <4>

Review: Multicycle RISC-V Processor

ImmExt

CLK

ARD

Instr / DataMemory

PC 01

Instr

SrcB

ALUResult

SrcA

ALUOut

MemWrite

ALUSrcA1:0

RegWrite

Zero

ResultSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

A

WriteD

ata

4

CLK

EN

ALUSrcB1:0

IRWrite

AdrSrcPCWrite

ReadD

ata

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

19:15

11:7

31:7

24:20 000110

Result

14:12

30 funct75funct3

Zero

6:0 op

ControlUnit

ImmSrc1:0

Extend

Rs1

Rs2

CLK

OldPC

Rd

EN

000110

000110

PCNext

Page 5: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <5>

Review: Multicycle Main FSM

S1: DecodeALUSrcA = 01ALUSrcB = 01ALUOp = 00

S8: ExecuteIALUSrcA = 10ALUSrcB = 01ALUOp = 10

Reset

S6: ExecuteRALUSrcA = 10ALUSrcB = 00ALUOp = 10

S2: MemAdrALUSrcA = 10ALUSrcB = 01ALUOp = 00

op = 0000011 (lw)OR

op = 0100011 (sw)

op = 0000011

(lw)

op = 0100011

(sw)

op = 0110011(R-type)

op = 0010011 (I-type ALU)

op = 1101111

(jal)

op = 1100011 (beq)

S9: JALALUSrcA = 01ALUSrcB = 10ALUOp = 00

ResultSrc = 00PCUpdate

S4: MemWBResultSrc = 01

RegWrite

S7: ALUWBResultSrc = 00

RegWrite

S10: BEQALUSrcA = 10ALUSrcB = 00ALUOp = 01

ResultSrc = 00Branch

S0: FetchAdrSrc = 0

IRWriteALUSrcA = 00ALUSrcB =10ALUOp = 00

ResultSrc = 10PCUpdate

S3: MemReadResultSrc = 00

AdrSrc = 1

S5: MemWriteResultSrc = 00

AdrSrc = 1MemWrite

State Datapath µOpFetch Instr ←Mem[PC]; PC ← PC+4Decode ALUOut ← PCTargetMemAdr ALUOut ← rs1 + immMemRead Data ← Mem[ALUOut]MemWB rd ← DataMemWrite Mem[ALUOut] ← rdExecuteR ALUOut ← rs1 op rs2ExecuteI ALUOut ← rs1 op immALUWB rd ← ALUOutBEQ ALUResult = rs1-rs2; if Zero, PC = ALUOutJAL PC = ALUOut; ALUOut = PC+4

Page 6: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <6>

• Deep Pipelining• Micro-operations• Branch Prediction• Superscalar Processors• Out of Order Processors• Register Renaming• SIMD• Multithreading• Multiprocessors

Advanced Microarchitecture

Page 7: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <7>

• 10-20 stages typical• Number of stages limited by:– Pipeline hazards– Sequencing overhead– Power– Cost

Deep Pipelining

Page 8: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <8>

• Decompose more complex instructions into a series of simple instructions called micro-operations (micro-ops or µ-ops)

• At run-time, complex instructions are decoded into one or more micro-ops

• Used heavily in CISC (complex instruction set computer) architectures (e.g., x86)

Complex Op Micro-op Sequencelw s1, 0(s2), postincr 4 lw s1, 0(s2)

addi s2, s2, 4

Without μ-ops, would need 2nd write port on the register file

Micro-operations

Page 9: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <9>

• Guess whether branch will be taken– Backward branches are usually taken (loops)– Consider history to improve guess

• Good prediction reduces fraction of branches requiring a flush

Branch Prediction

Page 10: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <10>

• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:– Check direction of branch (forward or backward)– If backward, predict taken– Else, predict not taken

• Dynamic branch prediction:– Keep history of last several hundred (or thousand)

branches in branch target buffer, record:• Branch destination• Whether branch was taken

Branch Prediction

Page 11: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <11>

addi s1, zero, 0 # s1 = sumaddi s0, zero, 0 # s0 = i

addi t0, zero, 10 # t0 = 10

For: # for (i=0; i<10; i=i+1)

bge s0, t0, Doneadd s1, s1, s0 # sum = sum + i

addi s0, s0, 1 # i = i + 1

j For

Done:

Branch Prediction Example

Page 12: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <12>

• Remembers whether branch was taken the last time and does the same thing

• Mispredicts first and last branch of loop

1-Bit Branch Predictor

Page 13: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <13>

Only mispredicts last branch of loop

stronglytaken

predicttaken

weaklytaken

predicttaken

weaklynot taken

predictnot taken

stronglynot taken

predictnot taken

taken taken taken

takentakentaken

taken

taken

2-Bit Branch Predictor

Page 14: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <14>

• Multiple copies of datapath execute multiple instructions at once

• Dependencies make it tricky to issue multiple instructions at once

CLK CLK CLK CLK

ARD A1

A2RD1A3

WD3WD6

A4A5A6

RD4

RD2RD5

InstructionMemory

RegisterFile Data

Memory

ALUs

PC

CLK

A1A2

WD1WD2

RD1RD2

Superscalar

Page 15: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <15>

Ideal IPC: 2Actual IPC: 2

Superscalar Example

Time (cycles)

1 2 3 4 5 6 7 8

RF40

R0

RF

R8+

DMIM

LDR

ADD

LDR R8, [R0, #40]

ADD R9, R1, R2

SUB R10, R1, R3

AND R11, R3, R4

ORR R12, R1, R5

STR R5, [R0, #80]

R9R2

R1

+

RFR3

R1

RF

R10-

DMIM

SUB

AND R11R4

R3

&

RFR5

R1

RF

R12|

DMIM

ORR

STR 80

R0

+ R5

Page 16: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <16>

Superscalar with Dependencies

Stall

Time (cycles)

1 2 3 4 5 6 7 8

RF40

R0

RF

R8+

DMIM

LDRLDR R8, [R0, #40]

ADD R9, R8, R1

SUB R8, R2, R3

AND R10, R4, R8

STR R7, [R11, #80]

RFR1

R8ADD

RFR1

R8

RF

R9+

DM

RFR8

R4

RF

R10&

DMIM

AND

IMORR

AND

SUB

|R6

R5R11

RF80

R11

RF+

DMSTR

IM

R7

9

R3

R2

R3

R2-

R8

ORRORR R11, R5, R6

IM

Ideal IPC: 2Actual IPC: 6/5 = 1.2

Page 17: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <17>

• Looks ahead across multiple instructions• Issues as many instructions as possible at once• Issues instructions out of order (as long as no

dependencies)• Dependencies:

– RAW (read after write): one instruction writes, later instruction reads a register

– WAR (write after read): one instruction reads, later instruction writes a register

– WAW (write after write): one instruction writes, later instruction writes a register

Out of Order Processor

Page 18: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <18>

• Instruction level parallelism (ILP): number of instruction that can be issued simultaneously (average < 3)

• Scoreboard: table that keeps track of:– Instructions waiting to issue–Available functional units–Dependencies

Out of Order Processor

Page 19: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <19>

LDR R8, [R0, #40]

ADD R9, R8, R1

SUB R8, R2, R3 Ideal IPC: 2AND R10, R4, R8 Actual IPC: 6/4 = 1.5ORR R11, R5, R6STR R7, [R11, #80]

Out of Order Processor Example

Time (cycles)

1 2 3 4 5 6 7 8

RF40

R0

RF

R8+

DMIM

LDRLDR R8, [R0, #40]

ADD R9, R8, R1

SUB R8, R2, R3

AND R10, R4, R8

STR R7, [R11, #80]

ORR|R6

R5R11

RF80

R11

RF+

DMSTR R7

ORR R11, R5, R6

IM

RFR1

R8

RF

R9+

DMIM

ADD

SUB-R3

R2R8

two cycle latencybetween load anduse of R8

RAW

WAR

RAW

RFR8

R4

RF&

DMAND

IM

R10

RAW

Page 20: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <20>

LDR R8, [R0, #40]

ADD R9, R8, R1

SUB R8, R2, R3 Ideal IPC: 2AND R10, R4, R8 Actual IPC: 6/3 = 2ORR R11, R5, R6

STR R7, [R11, #80]

Register Renaming

Time (cycles)

1 2 3 4 5 6 7

RF40

R0

RF

R8+

DMIM

LDRLDR R8, [R0, #40]

ADD R9, R8, R1

SUB T0, R2, R3

AND R10, R4, T0

STR R7, [R11, #80]

SUB-R3

R2T0

RFT0

R4

RF&

DMAND

R7

ORR R11, R5, R6IM

RFR1

R8

RF

R9+

DMIM

ADD

STR+80

R11

RAW

R6

R5|

ORR

2-cycle RAW

RAW

R10

R11

Page 21: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <21>

• Single Instruction Multiple Data (SIMD)– Single instruction acts on multiple pieces of data at once– Common application: graphics– Perform short arithmetic operations (also called packed

arithmetic)

• For example, add eight 8-bit elements

SIMD

a0

0781516232431 Bit position

D0a1a2a3

b0 D1b1b2b3

a0 + b0 D2a1 + b1a2 + b2a3 + b3

+

a4a5a6a7

b4b5b6b7

a4 + b4a5 + b5a6 + b6a7 + b7

3239404748555663

Page 22: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <22>

• Multithreading– Wordprocessor: thread for typing, spell checking,

printing

• Multiprocessors– Multiple processors (cores) on a single chip

Advanced Architecture Techniques

Page 23: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <23>

• Process: program running on a computer– Multiple processes can run at once: e.g., surfing

Web, playing music, writing a paper

• Thread: part of a program– Each process has multiple threads: e.g., a word

processor may have threads for typing, spell checking, printing

Threading: Definitions

Page 24: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <24>

• One thread runs at once• When one thread stalls (for example, waiting

for memory):– Architectural state of that thread stored– Architectural state of waiting thread loaded into

processor and it runs– Called context switching

• Appears to user like all threads running simultaneously

Threads in Conventional Processor

Page 25: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <25>

• Multiple copies of architectural state• Multiple threads active at once:– When one thread stalls, another runs immediately– If one thread can’t keep all execution units busy,

another thread can use them

• Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput

Intel calls this “hyperthreading”

Multithreading

Page 26: David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect23.pdfDigital Design and Computer Architecture: RISC-V Edition Chapter 7  Harris & Harris © 2020 Elsevier

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7 <26>

• Multiple processors (cores) with a method of communication between them

• Types:– Homogeneous: multiple cores with shared main

memory– Heterogeneous: separate cores for different tasks (for

example, DSP and CPU in cell phone)– Clusters: each core has own memory system

Multiprocessors


Recommended