Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | naida-good |
View: | 17 times |
Download: | 0 times |
EECS 470Computer Architecture
Lecture 3
Coverage: Appendix A
rev 2 2
llll
Role of the Compiler
• The primary user of the instruction set– Exceptions: getting less common
• Some device drivers; specialized library routines• Some small embedded systems (synthesized arch)
• Compilers must:– generate a correct translation into machine code
• Compilers should: – fast compile time; generate fast code
• While we are at it: – generate reasonable code size; good debug support
rev 2 3
llll
Structure of Compilers
• Front-end: translate high level semantics to some generic intermediate form– Intermediate form does not have any
resource constraints, but uses simple instructions.
• Back-end: translates intermediate form into assembly/machine code for target architecture– Resource allocation; code optimization under
resource constraintsArchitects mostly concerned with optimization
rev 2 4
llll
Typical optimizations: CSE
• Common sub-expression eliminationc = array1[d+e] / array2[d+e];
c = array1[i] / arrray2[i];
• Purpose: –reduce instructions / faster code
• Architectural issues: –more register pressure
rev 2 5
llll
Typical optimization: LICM
• Loop invariant code motionfor (i=0; i<100; i++) { t = 5; array1[i] = t;}
• Purpose: –remove statements or expressions from loops
that need only be executed once (idempotent)
• Architectural issues:–more register pressure
rev 2 6
llll
Other transformations
• Procedure inlining: better inst schedule– greater code size, more register pressure
• Loop unrolling: better loop schedule– greater code size, more register pressure
• Software pipelining: better loop schedule– greater code size; more register pressure
• In general – “global”optimization: faster code– greater code size; more register pressure
rev 2 7
llll
Compiled code characteristics
• Optimized code has different characteristics than unoptimized code.– Fewer memory references, but it is generally the
“easy ones” that are eliminated• Example: Better register allocation retains active data in
register file – these would be cache hits in unoptimized code.
– Removing redundant memory and ALU operations leaves a higher ratio of branches in the code
• Branch prediction becomes more important
Many optimizations provide better instruction schedulingat the cost of an increase in hardware resource pressure
rev 2 8
llll
What do compiler writers want in an instruction set architecture?
• More resources: better optimization tradeoffs• Regularity: same behavior in all contexts
– no special cases (flags set differently for immediates)
• Orthogonality: – data type independent of addressing mode– addressing mode independent of operation
performed
• Primitives, not solutions: – keep instructions simple– it is easier to compose than to fit. (ex. MMX operations)
rev 2 9
llll
What do architects want in an instruction set architecture?
• Simple instruction decode: – tends to increase orthogonality
• Small structures: – more resource constraints
• Small data bus fanout:– tends to reduce orthogonality; regularity
• Small instructions: – Make things implicit– non-regular; non-orthogonal; non-primative
rev 2 10
llll
To make faster processors
• Make the compiler team unhappy– More aggressive optimization over the entire
program– More resource constraints; caches; HW schedulers– Higher expectations: increase IPC
• Make hardware design team unhappy– Tighter design constraints (clock)– Execute optimized code with more complex
execution characteristics– Make all stages bottlenecks (Amdahl’s law)
rev 2 11
llll
Review of basic pipelining
• 5 stage “RISC” load-store architecture– About as simple as things get
1. Instruction fetch: • get instruction from memory/cache
2. Instruction decode: • translate opcode into control signals and read regs
3. Execute: • perform ALU operation
4. Memory: • Access memory if load/store
5. Writeback/retire: • update register file
Pipelined implementation
• Break the execution of the instruction into cycles (5 in this case).
• Design a separate datapath stage for the execution performed during each cycle.
• Build pipeline registers to communicate between the stages.
Stage 1: Fetch
• Design a datapath that can fetch an instruction from memory every cycle.– Use PC to index memory to read instruction– Increment the PC (assume no branches for now)
• Write everything needed to complete execution to the pipeline register (IF/ID)– The next stage will read this pipeline register.– Note that pipeline register must be edge triggered
Inst
ruct
ion
bit
s
IF / IDPipeline register
PC
InstructionMemory/
Cache
en
en
1
+
MUX
Res
t of
pip
elin
ed d
atap
ath
PC
+ 1
Stage 2: Decode
• Design a datapath that reads the IF/ID pipeline register, decodes instruction and reads register file (specified by regA and regB of instruction bits).– Decode can be easy, just pass on the opcode and let
later stages figure out their own control signals for the instruction.
• Write everything needed to complete execution to the pipeline register (ID/EX)– Pass on the offset field and both destination register
specifiers (or simply pass on the whole instruction!).– Including PC+1 even though decode didn’t use it.
Destreg
Data
ID / EXPipeline register
Con
ten
tsO
f re
gAC
onte
nts
Of
regB
Register File
regAregB
en
Res
t of
pip
elin
ed d
atap
ath
Inst
ruct
ion
bit
s
IF / IDPipeline register
PC
+ 1
PC
+ 1
Con
trol
Sig
nal
sSta
ge 1
: F
etch
dat
apat
h
Stage 3: Execute• Design a datapath that performs the proper ALU
operation for the instruction specified and the values present in the ID/EX pipeline register.– The inputs are the contents of regA and either the
contents of RegB or the offset field on the instruction.– Also, calculate PC+1+offset in case this is a branch.
• Write everything needed to complete execution to the pipeline register (EX/Mem)– ALU result, contents of regB and PC+1+offset– Instruction bits for opcode and destReg specifiers
ID / EXPipeline register
Con
ten
tsO
f re
gAC
onte
nts
Of
regB
Res
t of
pip
elin
ed d
atap
ath
Alu
Res
ult
EX/MemPipeline register
PC
+ 1
Con
trol
Sig
nal
sSta
ge 2
: D
ecod
e d
atap
ath
Con
trol
Sign
als
PC
+1
+of
fset
+
con
ten
ts
of r
egB
ALU
MUX
Stage 4: Memory Operation
• Design a datapath that performs the proper memory operation for the instruction specified and the values present in the EX/Mem pipeline register.– ALU result contains address for ld and st instructions.– Opcode bits control memory R/W and enable signals.
• Write everything needed to complete execution to the pipeline register (Mem/WB)– ALU result and MemData– Instruction bits for opcode and destReg specifiers
Alu
Res
ult
Mem/WBPipeline register
Res
t of
pip
elin
ed d
atap
ath
Alu
Res
ult
EX/MemPipeline register
Sta
ge 3
: E
xecu
te d
atap
ath
Con
trol
Sign
als
PC
+1
+of
fset
con
ten
ts
of r
egB
This goes back to the MUXbefore the PC in stage 1.
Mem
ory
Rea
d D
ata
Data Memory
en R/W
Con
trol
Sign
als
MUX controlfor PC input
Stage 5: Write back
• Design a datapath that conpletes the execution of this instruction, writing to the register file if required.– Write MemData to destReg for ld instruction– Write ALU result to destReg for add or nand
instructions.– Opcode bits also control register write enable
signal.
Alu
Res
ult
Mem/WBPipeline register
Sta
ge 4
: M
emor
y d
atap
ath
Con
trol
Sign
als
Mem
ory
Rea
d D
ata
MUX
This goes back to data input of register file
This goes back to thedestination register specifier
MUX
bits 0-2
bits 16-18
register write enable
Sample Code (Simple)
• Run the following code on a pipelined datapath: add 1 2 3 ; reg 3 = reg 1 + reg 2
nand 4 5 6 ; reg 6 = reg 4 & reg 5
lw 2 4 20 ; reg 4 = Mem[reg2+20]
add 2 5 5 ; reg 5 = reg 2 + reg 5
sw 3 7 10 ; Mem[reg3+10] =reg 7
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
op
dest
offset
valB
valA
PC+1PC+1target
ALUresult
op
dest
valB
op
dest
ALUresult
mdata
eq?instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
Bits 22-24
data
dest
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
noop
0
0
0
0
000
0
noop
0
0
noop
0
0
0
0
noop
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
InitialState
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
noop
0
0
0
0
010
0
noop
0
0
noop
0
0
0
0add
1 2 3
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
Fetch: add 1 2 3
add 1 2 3
Time: 1
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
add
3
3
9
36
120
0
noop
0
0
noop
0
0
0
0nan
d 4 5 6
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
1
2
Bits 22-24
data
dest
Fetch: nand 4 5 6
nand 4 5 6 add 1 2 3
Time: 2
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
nand
6
6
7
18
234
45
add
3
9
noop
0
0
0
0lw 2 4 20
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
4
5
Bits 22-24
data
dest
Fetch: lw 2 4 20
lw 2 4 20 nand 4 5 6 add 1 2 3
Time: 3
36
9
1
3
3
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
lw
4
20
18
9
348
-3
nand
6
7
add
3
45
0
0add
2 5 8
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
2
4
Bits 22-24
data
dest
Fetch: add 2 5 5
add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3
Time: 4
18
7
2
6
6
45
3
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
add
5
5
7
9
4523
29
lw
4
18
nand
6
-3
0
0sw 3 7 10
945187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
2
5
Bits 22-24
data
dest
Fetch: sw 3 7 10
sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add
Time: 5
9
20
3
20
4
-3
6
45
3
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
sw
7
10
22
45
5 9
16
add
5
7
lw
4
29
99
0
945187
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
3
7
Bits 22-24
data
dest
No moreinstructions
sw 3 7 10 add 2 5 5 lw 2 4 20 nand
Time: 6
9
7
4
5
5
29
4
-3
6
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
15
55
sw
7
22
add
5
16
0
0
945997
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
No moreinstructions
sw 3 7 10 add 2 5 5 lw
Time: 7
45
5
10
7
10
16
5
99
4
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
sw
7
55
0
9459916
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
No moreinstructions
sw 3 7 10 add
Time: 8
2255
22
16
5
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
9459916
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
No moreinstructions
sw
Time: 9
Time graphs
Time: 1 2 3 4 5 6 7 8 9
add
nand
lw
add
sw
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
What can go wrong?
• Data hazards: since register reads occur in stage 2 and register writes occur in stage 5 it is possible to read the wrong value if is about to be written.
• Control hazards: A branch instruction may change the PC, but not until stage 4. What do we fetch before that?
• Exceptions: How do you handle exceptions in a pipelined processor with 5 instructions in flight?