EECS 470

EECS 470Computer Architecture

Lecture 3

Coverage: Appendix A

rev 2 2

llll

Role of the Compiler

• The primary user of the instruction set– Exceptions: getting less common

• Some device drivers; specialized library routines• Some small embedded systems (synthesized arch)

• Compilers must:– generate a correct translation into machine code

• Compilers should: – fast compile time; generate fast code

• While we are at it: – generate reasonable code size; good debug support

rev 2 3

llll

Structure of Compilers

• Front-end: translate high level semantics to some generic intermediate form– Intermediate form does not have any

resource constraints, but uses simple instructions.

• Back-end: translates intermediate form into assembly/machine code for target architecture– Resource allocation; code optimization under

resource constraintsArchitects mostly concerned with optimization

rev 2 4

llll

Typical optimizations: CSE

• Common sub-expression eliminationc = array1[d+e] / array2[d+e];

c = array1[i] / arrray2[i];

• Purpose: –reduce instructions / faster code

• Architectural issues: –more register pressure

rev 2 5

llll

Typical optimization: LICM

• Loop invariant code motionfor (i=0; i<100; i++) { t = 5; array1[i] = t;}

• Purpose: –remove statements or expressions from loops

that need only be executed once (idempotent)

• Architectural issues:–more register pressure

rev 2 6

llll

Other transformations

• Procedure inlining: better inst schedule– greater code size, more register pressure

• Loop unrolling: better loop schedule– greater code size, more register pressure

• Software pipelining: better loop schedule– greater code size; more register pressure

• In general – “global”optimization: faster code– greater code size; more register pressure

rev 2 7

llll

Compiled code characteristics

• Optimized code has different characteristics than unoptimized code.– Fewer memory references, but it is generally the

“easy ones” that are eliminated• Example: Better register allocation retains active data in

register file – these would be cache hits in unoptimized code.

– Removing redundant memory and ALU operations leaves a higher ratio of branches in the code

• Branch prediction becomes more important

Many optimizations provide better instruction schedulingat the cost of an increase in hardware resource pressure

rev 2 8

llll

What do compiler writers want in an instruction set architecture?

• More resources: better optimization tradeoffs• Regularity: same behavior in all contexts

– no special cases (flags set differently for immediates)

• Orthogonality: – data type independent of addressing mode– addressing mode independent of operation

performed

• Primitives, not solutions: – keep instructions simple– it is easier to compose than to fit. (ex. MMX operations)

rev 2 9

llll

What do architects want in an instruction set architecture?

• Simple instruction decode: – tends to increase orthogonality

• Small structures: – more resource constraints

• Small data bus fanout:– tends to reduce orthogonality; regularity

• Small instructions: – Make things implicit– non-regular; non-orthogonal; non-primative

rev 2 10

llll

To make faster processors

• Make the compiler team unhappy– More aggressive optimization over the entire

program– More resource constraints; caches; HW schedulers– Higher expectations: increase IPC

• Make hardware design team unhappy– Tighter design constraints (clock)– Execute optimized code with more complex

execution characteristics– Make all stages bottlenecks (Amdahl’s law)

rev 2 11

llll

Review of basic pipelining

• 5 stage “RISC” load-store architecture– About as simple as things get

1. Instruction fetch: • get instruction from memory/cache

2. Instruction decode: • translate opcode into control signals and read regs

3. Execute: • perform ALU operation

4. Memory: • Access memory if load/store

5. Writeback/retire: • update register file

Pipelined implementation

• Break the execution of the instruction into cycles (5 in this case).

• Design a separate datapath stage for the execution performed during each cycle.

• Build pipeline registers to communicate between the stages.

Stage 1: Fetch

• Design a datapath that can fetch an instruction from memory every cycle.– Use PC to index memory to read instruction– Increment the PC (assume no branches for now)

• Write everything needed to complete execution to the pipeline register (IF/ID)– The next stage will read this pipeline register.– Note that pipeline register must be edge triggered

Inst

ruct

ion

bit

s

IF / IDPipeline register

PC

InstructionMemory/

Cache

en

en

1

+

MUX

Res

t of

pip

elin

ed d

atap

ath

PC

+ 1

Stage 2: Decode

• Design a datapath that reads the IF/ID pipeline register, decodes instruction and reads register file (specified by regA and regB of instruction bits).– Decode can be easy, just pass on the opcode and let

later stages figure out their own control signals for the instruction.

• Write everything needed to complete execution to the pipeline register (ID/EX)– Pass on the offset field and both destination register

specifiers (or simply pass on the whole instruction!).– Including PC+1 even though decode didn’t use it.

Destreg

Data

ID / EXPipeline register

Con

ten

tsO

f re

gAC

onte

nts

Of

regB

Register File

regAregB

en

Res

t of

pip

elin

ed d

atap

ath

Inst

ruct

ion

bit

s

IF / IDPipeline register

PC

+ 1

PC

+ 1

Con

trol

Sig

nal

sSta

ge 1

: F

etch

dat

apat

h

Stage 3: Execute• Design a datapath that performs the proper ALU

operation for the instruction specified and the values present in the ID/EX pipeline register.– The inputs are the contents of regA and either the

contents of RegB or the offset field on the instruction.– Also, calculate PC+1+offset in case this is a branch.

• Write everything needed to complete execution to the pipeline register (EX/Mem)– ALU result, contents of regB and PC+1+offset– Instruction bits for opcode and destReg specifiers

ID / EXPipeline register

Con

ten

tsO

f re

gAC

onte

nts

Of

regB

Res

t of

pip

elin

ed d

atap

ath

Alu

Res

ult

EX/MemPipeline register

PC

+ 1

Con

trol

Sig

nal

sSta

ge 2

: D

ecod

e d

atap

ath

Con

trol

Sign

als

PC

+1

+of

fset

+

con

ten

ts

of r

egB

ALU

MUX

Stage 4: Memory Operation

• Design a datapath that performs the proper memory operation for the instruction specified and the values present in the EX/Mem pipeline register.– ALU result contains address for ld and st instructions.– Opcode bits control memory R/W and enable signals.

• Write everything needed to complete execution to the pipeline register (Mem/WB)– ALU result and MemData– Instruction bits for opcode and destReg specifiers

Alu

Res

ult

Mem/WBPipeline register

Res

t of

pip

elin

ed d

atap

ath

Alu

Res

ult

EX/MemPipeline register

Sta

ge 3

: E

xecu

te d

atap

ath

Con

trol

Sign

als

PC

+1

+of

fset

con

ten

ts

of r

egB

This goes back to the MUXbefore the PC in stage 1.

Mem

ory

Rea

d D

ata

Data Memory

en R/W

Con

trol

Sign

als

MUX controlfor PC input

Stage 5: Write back

• Design a datapath that conpletes the execution of this instruction, writing to the register file if required.– Write MemData to destReg for ld instruction– Write ALU result to destReg for add or nand

instructions.– Opcode bits also control register write enable

signal.

Alu

Res

ult

Mem/WBPipeline register

Sta

ge 4

: M

emor

y d

atap

ath

Con

trol

Sign

als

Mem

ory

Rea

d D

ata

MUX

This goes back to data input of register file

This goes back to thedestination register specifier

MUX

bits 0-2

bits 16-18

register write enable

Sample Code (Simple)

• Run the following code on a pipelined datapath: add 1 2 3 ; reg 3 = reg 1 + reg 2

nand 4 5 6 ; reg 6 = reg 4 & reg 5

lw 2 4 20 ; reg 4 = Mem[reg2+20]

add 2 5 5 ; reg 5 = reg 2 + reg 5

sw 3 7 10 ; Mem[reg3+10] =reg 7

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

op

dest

offset

valB

valA

PC+1PC+1target

ALUresult

op

dest

valB

op

dest

ALUresult

mdata

eq?instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

Bits 22-24

data

dest

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

noop

0

0

0

0

000

0

noop

0

0

noop

0

0

0

0

noop

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

InitialState

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

noop

0

0

0

0

010

0

noop

0

0

noop

0

0

0

0add

1 2 3

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

Fetch: add 1 2 3

add 1 2 3

Time: 1

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

add

3

3

9

36

120

0

noop

0

0

noop

0

0

0

0nan

d 4 5 6

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

1

2

Bits 22-24

data

dest

Fetch: nand 4 5 6

nand 4 5 6 add 1 2 3

Time: 2

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

nand

6

6

7

18

234

45

add

3

9

noop

0

0

0

0lw 2 4 20

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

4

5

Bits 22-24

data

dest

Fetch: lw 2 4 20

lw 2 4 20 nand 4 5 6 add 1 2 3

Time: 3

36

9

1

3

3

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

lw

4

20

18

9

348

-3

nand

6

7

add

3

45

0

0add

2 5 8

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

2

4

Bits 22-24

data

dest

Fetch: add 2 5 5

add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3

Time: 4

18

7

2

6

6

45

3

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

add

5

5

7

9

4523

29

lw

4

18

nand

6

-3

0

0sw 3 7 10

945187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

2

5

Bits 22-24

data

dest

Fetch: sw 3 7 10

sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add

Time: 5

9

20

3

20

4

-3

6

45

3

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

sw

7

10

22

45

5 9

16

add

5

7

lw

4

29

99

0

945187

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

3

7

Bits 22-24

data

dest

No moreinstructions

sw 3 7 10 add 2 5 5 lw 2 4 20 nand

Time: 6

9

7

4

5

5

29

4

-3

6

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

15

55

sw

7

22

add

5

16

0

0

945997

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

No moreinstructions

sw 3 7 10 add 2 5 5 lw

Time: 7

45

5

10

7

10

16

5

99

4

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

sw

7

55

0

9459916

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

No moreinstructions

sw 3 7 10 add

Time: 8

2255

22

16

5

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

9459916

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

No moreinstructions

sw

Time: 9

Time graphs

Time: 1 2 3 4 5 6 7 8 9

add

nand

lw

add

sw

fetch decode execute memory writeback





What can go wrong?

• Data hazards: since register reads occur in stage 2 and register writes occur in stage 5 it is possible to read the wrong value if is about to be written.

• Control hazards: A branch instruction may change the PC, but not until stage 4. What do we fetch before that?

• Exceptions: How do you handle exceptions in a pipelined processor with 5 instructions in flight?

Date post:	30-Dec-2015
Category:	Documents
Upload:	naida-good
View:	17 times
Download:	0 times

EECS 470

Documents