331 W07.1 Fall 2003
14:332:331Computer Architecture and Assembly Language
Fall 2003
Week 7
[Adapted from Dave Patterson’s UCB CS152 slides and
Mary Jane Irwin’s PSU CSE331 slides]
331 W07.2 Fall 2003
331 W07.3 Fall 2003
Head’s Up This week’s material
MIPS logic and multiply instructions- Reading assignment – PH 4.4
MIPS ALU design- Reading assignment – PH 4.5
Next week’s material Building a MIPS datapath
- Reading assignment – PH 5.1-5.2
331 W07.4 Fall 2003
Review: MIPS Arithmetic Instructions
R-type:
I-Type:
31 25 20 15 5 0
op Rs Rt Rd funct
op Rs Rt Immed 16
Type op funct
ADD 00 100000
ADDU 00 100001
SUB 00 100010
SUBU 00 100011
AND 00 100100
OR 00 100101
XOR 00 100110
NOR 00 100111
Type op funct
00 101000
00 101001
SLT 00 101010
SLTU 00 101011
00 101100
0 add
1 addu
2 sub
3 subu
4 and
5 or
6 xor
7 nor
a slt
b sltu
expand immediates to 32 bits before ALU10 operations so can encode in 4 bits
32
32
32
m (operation)
result
A
B
ALU
4
zeroovf
11
331 W07.5 Fall 2003
Review: A 32-bit Adder/Subtractor
1-bit FA S0
c0=carry_in
c1
1-bit FA S1
c2
1-bit FA S2
c3
c32=carry_out
1-bit FA S31
c31
. .
.
Built out of 32 full adders (FAs) A0
B0
A1
B1
A2
B2
A31
B31
add/subt
1 bit FA
A
BS
carry_in
carry_out
S = A xor B xor carry_in
carry_out = AB v Acarry_in v Bcarry_in (majority function)
Small but slow!
331 W07.6 Fall 2003
Minimal Implementation of a Full Adder
architecture concurrent_behavior of full_adder is
signal t1, t2, t3, t4, t5: std_logic;
begin
t1 <= not A after 1 ns;
t2 <= not cin after 1 ns;
t4 <= not((A or cin) and B) after 2 ns;
t3 <= not((t1 or t2) and (A or cin)) after 2 ns;
t5 <= t3 nand B after 2 ns;
S <= not((B or t3) and t5) after 2 ns;
cout <= not(t1 or t2) and t4) after 2 ns;
end concurrent_behavior;
Can you create the equivalent schematic? Can you determine worst case delay (the worst case timing path through the circuit)?
Gate library: inverters, 2-input nands, or-and-inverters
331 W07.7 Fall 2003
Logic Operations Logic operations operate on individual bits of the
operand. $t2 = 0…0 0000 1101 0000
$t1 = 0…0 0011 1100 0000
and $t0, $t1, $t2 $t0 =
or $t0, $t1 $t2 $t0 =
xor $t0, $t1, $t2 $t0 =
nor $t0, $t1, $t2 $t0 =
How do we expand our FA design to handle the logic operations - and, or, xor, nor ?
331 W07.8 Fall 2003
A Simple ALU Cell
1-bit FA
carry_in
carry_out
A
B
add/subt
add/subt
result
op
331 W07.9 Fall 2003
An Alternative ALU Cell
1-bit FA
carry_in
s1
s2
s0
result
carry_out
A
B
331 W07.10 Fall 2003
The Alternative ALU Cell’s Control Codes
s2 s1 s0 c_in result function0 0 0 0 A transfer A
0 0 0 1 A + 1 increment A
0 0 1 0 A + B add
0 0 1 1 A + B + 1 add with carry
0 1 0 0 A – B – 1 subt with borrow
0 1 0 1 A – B subtract
0 1 1 0 A – 1 decrement A
0 1 1 1 A transfer A
1 0 0 x A or B or
1 0 1 x A xor B xor
1 1 0 x A and B and
1 1 1 x !A complement A
331 W07.11 Fall 2003
Need to support the set-on-less-than instruction
(slt)
remember: slt is an arithmetic instruction
produces a 1 if rs < rt and 0 otherwise
use subtraction: (a - b) < 0 implies a < b
Need to support test for equality (beq)
use subtraction: (a - b) = 0 implies a = b
Need to add the overflow detection hardware
Tailoring the ALU to the MIPS ISA
331 W07.12 Fall 2003
Modifying the ALU Cell for slt
1-bit FA
A
B
result
carry_in
carry_out
add/subt op
add/subt
less
331 W07.13 Fall 2003
Modifying the ALU for slt
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A31
B31
result31
less
. . .
First perform a subtraction
Make the result 1 if the subtraction yields a negative result
Make the result 0 if the subtraction yields a positive result
331 W07.14 Fall 2003
Modifying the ALU for Zero
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A31
B31
result31
less
. . .
0
0
set
First perform subtraction
Insert additional logic to detect when all result bits are zero
add/subtop
331 W07.15 Fall 2003
Review: Overflow Detection Overflow: the result is too large to represent in the
number of bits allocated
Overflow occurs when adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive gives a negative or, subtract a positive from a negative gives a positive
On your own: Prove you can detect overflow by: Carry into MSB xor Carry out of MSB
1
1
1 10
1
0
1
1
0
0 1 1 1
0 0 1 1+
7
3
0
1
– 6
1 1 0 0
1 0 1 1+
–4
– 5
71
0
331 W07.16 Fall 2003
Modifying the ALU for Overflow
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A31
B31
result31
less
. . .
0
0
set
Modify the most significant cell to determine overflow output setting
Disable overflow bit setting for unsigned arithmetic
zero
. . .
add/subtop
overflow
331 W07.17 Fall 2003
Example:
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
012345
2
+2
+6
+4
+8
+8
+8
+8
+6
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
012345
2
+2
+6
+4
+8
+8
+8
+8
+6
When do the result outputs settle at their final values for the inputs:
add/subt = 0op = 000A = 1111B = 0001
331 W07.18 Fall 2003
Example: cont’d
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
012345
2
+2
+6
+4
+8
+8
+8
+8
+6
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
012345
2
+2
+6
+4
+8
+8
+8
+8
+6
When do the result outputs settle at their final values for the inputs:
add/subt = 0op = 100A = 1111B = 0001
331 W07.19 Fall 2003
Example: cont’d
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
012345
2
+2
+6
+4
+8
+8
+8
+8
+6
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
012345
2
+2
+6
+4
+8
+8
+8
+8
+6
When do the result outputs settle at their final values for the inputs:
add/subt = 1op = 101A = 1111B = 0001
What is the zero output of these inputs?
331 W07.20 Fall 2003
Example: cont’d
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
012345
2
+2
+6
+4
+8
+8
+8
+8
+6
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
op
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A3
B3
result3
less
0
0
set
zero
add/subt
overflow
+
A2
B2
result2
less0
012345
2
+2
+6
+4
+8
+8
+8
+8
+6
With the ALU design described in class, we assumed that a subtraction operation had to be performed as part of the beq instruction. When do the outputs settle?
Is there a faster alternative?
331 W07.21 Fall 2003
But What about Performance? Critical path of n-bit ripple-carry adder is n*CP
Design trick – throw hardware at it (Carry Lookahead)
A0
B0
1-bitALU
Result0
CarryIn0
CarryOut0
A1
B1
1-bitALU
Result1
CarryIn1
CarryOut1
A2
B2
1-bitALU
Result2
CarryIn2
CarryOut2
A3
B3
1-bitALU
Result3
CarryIn3
CarryOut3
331 W07.22 Fall 2003
Fast carry using “infinite” hardware (Parallel) cout = b • cin + a • cin + a • b
c1 = (b0+a0)•c0 + a0•b0 = a0•b0 + a0•c0 + b0•c0
c2 = (b1+a1)•c1 + a1•b1
= (b1+a1)•((b0+a0)•c0 + a0•b0) + a1•b1
= a1•a0•b0 + a1•a0•c0 + b1•a0•c0 + b1•a0•b0 + a1•b0•c0 + b1•b0•c0 + b1•a1
c3 = a2•a1•a0•b0 + a2•a1•a0•c0 + a2•b1•a0•c0 + a2•b1•a0•b0 + a2•a1•b0•c0 + a2•b1•b0•c0 + a2•b1•a1 + …
…
Outputs settle much faster D_c3 = 2* D_and + D_or (best case) … D_c31 = 5 *D_and + D_or (best case)
Problem: Prohibitively expensive
331 W07.23 Fall 2003
Hierarchical Solution I
Hierarchical solution I Group 32 bits into 8 4-bit groups Within each group, use carry look ahead Use 4-bit as a building block, and connect them in ripple
carry fashion.
331 W07.24 Fall 2003
First Level: Propagate and generate
ci+1 = (ai•bi)+(ai+bi)•ci
gi = ai•bi
pi = (ai+bi)
ci+1 = 1 if gi = 1, or pi and ci = 1
c1 = g0+(p0•c0)
c2 = g1+(p1•g0)+(p1•p0•c0)
c3 = g2+(p2•g1)+(p2•p1•g0)+(p2•p1•p0•c0)
c4 = g3+(p3•g2)+(p3•p2•g1)+ (p3•p2•p1•g0) + (p3•p2•p1•p0•c0)
ci+1 = gi + pi•ci
331 W07.25 Fall 2003
Hierarchical Solution I (16 bit)
ALU0
A0
B0
c0=carry_in
A1B1
A2
B2A3
B3
ALU1
A4
B4
c4=carry_in
A5B5
A6
B6
A7B7
…
Delay = 4 * Delay ( 4-bit carry look-ahead ALU)
result 0-3
result 4-7
331 W07.26 Fall 2003
Hierarchical Solution II
Hierarchical solution I Group 32 bits into 8 4-bit groups Within each group, use carry look ahead Use 4-bit as a building block, and connect them in ripple
carry fashion.
Hierarchical solution II Group 32 bits into 8 4-bit groups Within each group, use carry look ahead Another level of carry look ahead is used to connect
these 4-bit groups
331 W07.27 Fall 2003
Hierarchical Solution IIA0B0
A3B3
A4B4
A7B7
A8B8
A11B11
A12B12
A15B15
cin
P0
G0
P1
G1
P2
G2
P3
G3
result 0-3
result 4-7
result 8-11
result 12-15
pi
gi
ci+1
C1
pi+1
gi+1
pi+2
pi+3
gi+2
gi+3
ci+2C2
ci+3C3
ci+3
cout
Carry-lookahead unit
•input a0-a15, b0-b15
•calculate P0-P3, G0-G3
•Calculate C1-C4
•each 4-bit ALU calculates its results
331 W07.28 Fall 2003
Fast Carry using the second level abstraction P0 = p3.p2.p1.p0
P1 = p7.p6.p5.p4
P2 = p11.p10.p9.p8
P3 = p15.p14.p13.p12
G0 = g3+(p3.g2) + (p3.p2.g1) + (p3.p2.p1.g0)
G1 = g7+(p7.g6) + (p7.p6.g5) + (p7.p6.p5.g4)
G2 = g11+(p11.g10)+(p11.p10.g9) + (p11.p10.p9.g8)
G3 = g15+(p15.g14)+(p15.p14.g3)+(p15.p14.p3.g12)
C1 = G+(P0•c0)
C2 = G1+(P1•G0)+(P1•P0•c0)
C3 = G2+(P2•G1)+(P2•P1•G0)+(P2•P1•P0•c0)
C4 = G3+(P3•G2)+(P3•P2•G1)+(P3•P2•P1•G0) + (P3•P2•P1•P0•c0)
331 W07.29 Fall 2003
Shift Operations Also need operations to pack and unpack 8-bit
characters into 32-bit words
Shifts move all the bits in a word left or right
sll $t2, $s0, 8 #$t2 = $s0 << 8 bits
srl $t2, $s0, 8 #$t2 = $s0 >> 8 bits
Such shifts are logical because they fill with zeros
op rs rt rd shamt funct
000000 00000 10000 01010 01000 000000
000000 00000 10000 01010 01000 000010
331 W07.30 Fall 2003
Shift Operations, con’t
An arithmetic shift (sra) maintain the arithmetic correctness of the shifted value (i.e., a number shifted right one bit should be ½ of its original value; a number shifted left should be 2 times its original value)
so sra uses the most significant bit (sign bit) as the bit shifted in
note that there is no need for a sla when using two’s complement number representation
sra $t2, $s0, 8 #$t2 = $s0 >> 8 bits
The shift operation is implemented by hardware (usually a barrel shifter) outside the ALU
000000 00000 10000 01010 01000 000011
331 W07.31 Fall 2003
More complicated than addition accomplished via shifting and addition
0010 (multiplicand) x_1011 (multiplier)
0010 0010 (partial product
0000 array) 0010 00010110 (product)
Double precision product produced
More time and more area to compute
Multiplication
331 W07.32 Fall 2003
mult $s0, $s1 # hi||lo = $s0 * $s1
Low-order word of the product is left in processor register lo and the high-order word is left in register hi
Instructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file
MIPS Multiply Instruction
op rs rt rd shamt funct
000000 10000 10001 00000 00000 011000
331 W07.33 Fall 2003
Review: MIPS ISA, so farCategory Instr Op Code Example Meaning
Arithmetic
(R & I format)
add 0 and 32 add $s1, $s2, $s3 $s1 = $s2 + $s3
add unsigned 0 and 33 addu $s1, $s2, $s3 $s1 = $s2 + $s3
subtract 0 and 34 sub $s1, $s2, $s3 $s1 = $s2 - $s3
subt unsigned 0 and 35 subu $s1, $s2, $s3 $s1 = $s2 - $s3
add immediate 8 addi $s1, $s2, 6 $s1 = $s2 + 6
add imm. unsigned 9 addiu $s1, $s2, 6 $s1 = $s2 + 6
multiply 0 and 24 mult $s1, $s2 hi || lo = $s1 * $s2
multiply unsigned 0 and 25 multu $s1, $s2 hi || lo = $s1 * $s2
divide 0 and 26 div $s1, $s2 lo = $s1/$s2, rem. in hi
divide unsigned 0 and 27 divu $s1, $s2 lo = $s1/$s2, rem. in hi
Logical
(R & I format)
and 0 and 36 and $s1, $s2, $s3 $s1 = $s2 & $s3
or 0 and 37 or $s1, $s2, $s3 $s1 = $s2 | $s3
xor 0 and 38 xor $s1, $s2, $s3 $s1 = $s2 xor $s3
nor 0 and 39 nor $s1, $s3, $s3 $s1 = !($s2 | $s2)
and immediate 12 andi $s1, $s2, 6 $s1 = $s2 & 6
or immediate 13 ori $s1, $s2, 6 $s1 = $s2 | 6
xor immediate 14 xori $s1, $s2, 6 $s1 = $s2 xor 6
331 W07.34 Fall 2003
Review: MIPS ISA, so far con’tCategory Instr Op Code Example Meaning
Shift
(R format)
sll 0 and 0 sll $s1, $s2, 4 $s1 = $s2 << 4
srl 0 and 2 srl $s1, $s2, 4 $s1 = $s2 >> 4
sra 0 and 3 sra $s1, $s2, 4 $s1 = $s2 >> 4
Data Transfer
(I format)
load word 35 lw $s1, 24($s2) $s1 = Memory($s2+24)
store word 43 sw $s1, 24($s2) Memory($s2+24) = $s1
load byte 32 lb $s1, 25($s2) $s1 = Memory($s2+25)
load byte unsigned 36 lbu $s1, 25($s2) $s1 = Memory($s2+25)
store byte 40 sb $s1, 25($s2) Memory($s2+25) = $s1
load upper imm 15 lui $s1, 6 $s1 = 6 * 216
move from hi 0 and 16 mfhi $s1 $s1 = hi
move to hi 0 and 17 mthi $s1 hi = $s1
move from lo 0 and 18 mflo $s1 $s1 = lo
move to lo 0 and 19 mtlo $s1 lo = $s1
331 W07.35 Fall 2003
Review: MIPS ISA, so far con’tCategory Instr Op Code Example Meaning
Cond. Branch
(I & R format)
br on equal 4 beq $s1, $s2, L if ($s1==$s2) go to L
br on not equal 5 bne $s1, $s2, L if ($s1 !=$s2) go to L
set on less than 0 and 42 slt $s1, $s2, $s3 if ($s2<$s3) $s1=1 else $s1=0
set on less than unsigned
0 and 43 sltu $s1, $s2, $s3
if ($s2<$s3) $s1=1 else $s1=0
set on less than immediate
10 slti $s1, $s2, 6 if ($s2<6) $s1=1 else $s1=0
set on less than imm. unsigned
11 sltiu $s1, $s2, 6 if ($s2<6) $s1=1 else $s1=0
Uncond. Jump (J & R format)
jump 2 j 2500 go to 10000
jump and link 3 jal 2500 go to 10000; $ra=PC+4
jump register 0 and 8 jr $s1 go to $s1
jump and link reg 0 and 9 jalr $s1, $s2 go to $s1, $s2=PC+4