CSE 331. Computer Organizationyyzhang/fall03/notes/331-week7.pdf · 2003. 10. 19. · 331 W07.4...

331 W07.1 Fall 2003

14:332:331Computer Architecture and Assembly Language

Fall 2003

Week 7

[Adapted from Dave Patterson’s UCB CS152 slides and

Mary Jane Irwin’s PSU CSE331 slides]

331 W07.2 Fall 2003

331 W07.3 Fall 2003

Head’s UpThis week’s material

MIPS logic and multiply instructions- Reading assignment – PH 4.4

MIPS ALU design- Reading assignment – PH 4.5

Next week’s materialBuilding a MIPS datapath

- Reading assignment – PH 5.1-5.2

331 W07.4 Fall 2003

Review: MIPS Arithmetic Instructions

R-type:

I-Type:

31 25 20 15 5 0

op Rs Rt Rd funct

op Rs Rt Immed 16

expand immediates to 32 bits before ALU10 operations so can encode in 4 bits

32

32

32

m (operation)

result

A

B

ALU

4

zeroovf

11

0 add

1 addu

2 sub

3 subu

4 and

5 or

6 xor

7 nor

a slt

b sltu

Type op funct

ADD 00 100000

ADDU 00 100001

SUB 00 100010

SUBU 00 100011

AND 00 100100

OR 00 100101

XOR 00 100110

NOR 00 100111

Type op funct

00 101000

00 101001

SLT 00 101010

SLTU 00 101011

00 101100

331 W07.5 Fall 2003

Review: A 32-bit Adder/Subtractoradd/subt

1-bit FA

c1

c0=carry_in

S0

1-bit FA

c2

S1

1-bit FA

c3

S2

c32=carry_out

1-bit FA

A31

B31

S31

c31

. . .

Built out of 32 full adders (FAs) A0

B0

A1

B1

A2

B2

1 bit FA

A

BS

carry_in

carry_out

S = A xor B xor carry_in

carry_out = A∧B v A∧carry_in v B∧carry_in(majority function)

Small but slow!

331 W07.6 Fall 2003

Minimal Implementation of a Full AdderGate library: inverters, 2-input nands, or-and-invertersarchitecture concurrent_behavior of full_adder is

signal t1, t2, t3, t4, t5: std_logic;begin

t1 <= not A after 1 ns;t2 <= not cin after 1 ns;t4 <= not((A or cin) and B) after 2 ns;t3 <= not((t1 or t2) and (A or cin)) after 2 ns;t5 <= t3 nand B after 2 ns;S <= not((B or t3) and t5) after 2 ns;cout <= not(t1 or t2) and t4) after 2 ns;

end concurrent_behavior;Can you create the equivalent schematic? Can you

determine worst case delay (the worst case timing path through the circuit)?

331 W07.7 Fall 2003

Logic OperationsLogic operations operate on individual bits of the operand.

$t2 = 0…0 0000 1101 0000$t1 = 0…0 0011 1100 0000

and $t0, $t1, $t2 $t0 =

or $t0, $t1 $t2 $t0 =

xor $t0, $t1, $t2 $t0 =

nor $t0, $t1, $t2 $t0 =

How do we expand our FA design to handle the logic operations - and, or, xor, nor ?

331 W07.8 Fall 2003

A Simple ALU Cell

1-bit FA

carry_in

carry_out

A

B

add/subt

add/subt

result

op

331 W07.9 Fall 2003

An Alternative ALU Cell

1-bit FA

carry_in

s2

s1

s0

result

carry_out

A

B

331 W07.10 Fall 2003

The Alternative ALU Cell’s Control Codes

transfer AA1110

complement A!Ax111andA and Bx011xorA xor Bx101orA or Bx001

decrement AA – 10110subtractA – B1010subt with borrowA – B – 10010add with carryA + B + 11100addA + B0100increment AA + 11000transfer AA0000

functionresultc_ins0s1s2

331 W07.11 Fall 2003

Tailoring the ALU to the MIPS ISA

Need to support the set-on-less-than instruction (slt)

remember: slt is an arithmetic instruction

produces a 1 if rs < rt and 0 otherwise

use subtraction: (a - b) < 0 implies a < b

Need to support test for equality (beq)

use subtraction: (a - b) = 0 implies a = b

Need to add the overflow detection hardware

331 W07.12 Fall 2003

Modifying the ALU Cell for slt

1-bit FA

carry_in

carry_out

add/subt op

less

A

result

B

add/subt

331 W07.13 Fall 2003

Modifying the ALU for slt

+

less

+

less

+

A31

B31less

. . .

A0

First perform a subtraction

Make the result 1 if the subtraction yields a negative result

Make the result 0 if the subtraction yields a positive result

result0

B0

A1

result1B1

result31

331 W07.14 Fall 2003

+

less

+

result0

less

+

A31

B31less

. . .

0

set

add/subtop

Modifying the ALU for ZeroA0

First perform subtraction

Insert additional logic to detect when all result bits are zero

B0

A1

result1

B10

result31

331 W07.15 Fall 2003

Review: Overflow DetectionOverflow: the result is too large to represent in the number of bits allocated

Overflow occurs whenadding two positives yields a negative or, adding two negatives gives a positiveor, subtract a negative from a positive gives a negativeor, subtract a positive from a negative gives a positive

On your own: Prove you can detect overflow by:Carry into MSB xor Carry out of MSB

1

1

0

1

1

0

0 1 1 1

0 0 1 1

0

1

1 10

1

1 1 0 0

1 0 1 1+

1

0

+

7

3

–4

– 5

– 6 7

331 W07.16 Fall 2003

+

less

+

result0

less

+

A31

B31

result31

less

. . .

0

set

Modify the most significant cell to determine overflow output setting

Disable overflow bit setting for unsigned arithmetic

zero

. . .

add/subtop

overflow

Modifying the ALU for OverflowA0

B0

A1

result1

B10

331 W07.17 Fall 2003

Example:When do the result outputs settle at their final values for the inputs:

add/subt = 0op = 000A = 1111B = 0001

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

012345

2

+ 2

+ 6

+ 4

+ 8

+ 8

+ 8

+ 8

+ 6

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

012345

2

+ 2

+ 6

+ 4

+ 8

+ 8

+ 8

+ 8

+ 6

331 W07.18 Fall 2003

Example: cont’dWhen do the result outputs settle at their final values for the inputs:

add/subt = 0op = 100A = 1111B = 0001

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

012345

2

+ 2

+ 6

+ 4

+ 8

+ 8

+ 8

+ 8

+ 6

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

012345

2

+ 2

+ 6

+ 4

+ 8

+ 8

+ 8

+ 8

+ 6

331 W07.19 Fall 2003

Example: cont’dWhen do the result outputs settle at their final values for the inputs:

add/subt = 1op = 101A = 1111B = 0001

What is the zero output of these inputs?

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

012345

2

+ 2

+ 6

+ 4

+ 8

+ 8

+ 8

+ 8

+ 6

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

012345

2

+ 2

+ 6

+ 4

+ 8

+ 8

+ 8

+ 8

+ 6

331 W07.20 Fall 2003

Example: cont’dWith the ALU design described in class, we assumed that a subtraction operation had to be performed as part of the beqinstruction. When do the outputs settle?

Is there a faster alternative?

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

012345

2

+ 2

+ 6

+ 4

+ 8

+ 8

+ 8

+ 8

+ 6

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

o p

+

A 1

B 1

re s u lt1

le s s

+

A 0

B 0

re s u lt0

le s s

+

A 3

B 3

re s u lt3

le s s

0

0

s e t

z e ro

a d d /s u b t

o v e r f lo w

+

A 2

B 2

re s u lt2

le s s0

012345

2

+ 2

+ 6

+ 4

+ 8

+ 8

+ 8

+ 8

+ 6

331 W07.21 Fall 2003

But What about Performance?Critical path of n-bit ripple-carry adder is n*CP

Design trick – throw hardware at it (Carry Lookahead)

A0

B01-bitALU

Result0

CarryOut0

A1

B11-bitALU

Result1

CarryIn1

CarryOut1

A2

B21-bitALU

Result2

CarryIn2

CarryOut2

A3

B31-bitALU

CarryIn3

CarryIn0

Result3

CarryOut3

331 W07.22 Fall 2003

Fast carry using “infinite” hardware (Parallel)cout = b • cin + a • cin + a • bc1 = (b0+a0)•c0 + a0•b0 = a0•b0 + a0•c0 + b0•c0 c2 = (b1+a1)•c1 + a1•b1

= (b1+a1)•((b0+a0)•c0 + a0•b0) + a1•b1= a1•a0•b0 + a1•a0•c0 + b1•a0•c0 + b1•a0•b0 + a1•b0•c0 + b1•b0•c0 + b1•a1

c3 = a2•a1•a0•b0 + a2•a1•a0•c0 + a2•b1•a0•c0 + a2•b1•a0•b0 + a2•a1•b0•c0 + a2•b1•b0•c0 + a2•b1•a1 + …

…Outputs settle much faster

D_c3 = 2* D_and + D_or (best case)…D_c31 = 5 *D_and + D_or (best case)

Problem: Prohibitively expensive

331 W07.23 Fall 2003

Hierarchical Solution IHierarchical solution I

Group 32 bits into 8 4-bit groupsWithin each group, use carry look aheadUse 4-bit as a building block, and connect them in ripple carry fashion.

331 W07.24 Fall 2003

First Level: Propagate and generateci+1 = (ai•bi)+(ai+bi)•cigi = ai•bipi = (ai+bi)ci+1 = 1 if

gi = 1, orpi and ci = 1

c1 = g0+(p0•c0)c2 = g1+(p1•g0)+(p1•p0•c0)c3 = g2+(p2•g1)+(p2•p1•g0)+(p2•p1•p0•c0)c4 = g3+(p3•g2)+(p3•p2•g1)+ (p3•p2•p1•g0) + (p3•p2•p1•p0•c0)

ci+1 = gi + pi•ci

331 W07.25 Fall 2003

Hierarchical Solution I (16 bit)

ALU0

A0B0

c0=carry_in

A1B1A2B2A3B3

ALU1

c4=carry_in

…

result 0-3

A4

Delay = 4 * Delay ( 4-bit carry look-ahead ALU) B4A5B5

result 4-7A6B6A7B7

331 W07.26 Fall 2003

Hierarchical Solution IIHierarchical solution I

Group 32 bits into 8 4-bit groupsWithin each group, use carry look aheadUse 4-bit as a building block, and connect them in ripple carry fashion.

Hierarchical solution IIGroup 32 bits into 8 4-bit groupsWithin each group, use carry look aheadAnother level of carry look ahead is used to connect these 4-bit groups

331 W07.27 Fall 2003

A0B0

A3B3

A4B4

A7B7

A8B8

A11B11

A12B12

A15B15

cin

P0G0

P1G1

P2G2

P3G3

result 4-7

result 8-11

pigi

ci+1C1

pi+1gi+1

pi+2

pi+3

gi+2

gi+3

ci+2C2

ci+3

ci+3

•input a0-a15, b0-b15

•calculate P0-P3, G0-G3

•Calculate C1-C4

•each 4-bit ALU calculates its results

Hierarchical Solution IIresult 0-3

Carry-lookahead unit

C3result 12-15

cout

331 W07.28 Fall 2003

Fast Carry using the second level abstractionP0 = p3.p2.p1.p0 P1 = p7.p6.p5.p4P2 = p11.p10.p9.p8P3 = p15.p14.p13.p12G0 = g3+(p3.g2) + (p3.p2.g1) + (p3.p2.p1.g0) G1 = g7+(p7.g6) + (p7.p6.g5) + (p7.p6.p5.g4)G2 = g11+(p11.g10)+(p11.p10.g9) + (p11.p10.p9.g8)G3 = g15+(p15.g14)+(p15.p14.g3)+(p15.p14.p3.g12)C1 = G+(P0•c0)C2 = G1+(P1•G0)+(P1•P0•c0)C3 = G2+(P2•G1)+(P2•P1•G0)+(P2•P1•P0•c0)C4 = G3+(P3•G2)+(P3•P2•G1)+(P3•P2•P1•G0) + (P3•P2•P1•P0•c0)

331 W07.29 Fall 2003

Shift OperationsAlso need operations to pack and unpack 8-bit characters into 32-bit words

Shifts move all the bits in a word left or right

sll $t2, $s0, 8 #$t2 = $s0 << 8 bits

srl $t2, $s0, 8 #$t2 = $s0 >> 8 bits

Such shifts are logical because they fill with zeros

op rs rt rd shamt funct

000000 00000 10000 01010 01000 000000

000000 00000 10000 01010 01000 000010

331 W07.30 Fall 2003

Shift Operations, con’t

An arithmetic shift (sra) maintain the arithmetic correctness of the shifted value (i.e., a number shifted right one bit should be ½ of its original value; a number shifted left should be 2 times its original value)

so sra uses the most significant bit (sign bit) as the bit shifted innote that there is no need for a sla when using two’s complement number representation

sra $t2, $s0, 8 #$t2 = $s0 >> 8 bits

The shift operation is implemented by hardware (usually a barrel shifter) outside the ALU

000000 00000 10000 01010 01000 000011

331 W07.31 Fall 2003

Multiplication

More complicated than additionaccomplished via shifting and addition

0010 (multiplicand)x_1011 (multiplier)

0010 0010 (partial product0000 array)

0010 00010110 (product)

Double precision product produced

More time and more area to compute

331 W07.32 Fall 2003

MIPS Multiply Instructionmult $s0, $s1 # hi||lo = $s0 * $s1

Low-order word of the product is left in processor register lo and the high-order word is left in register hiInstructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file

op rs rt rd shamt funct

000000 10000 10001 00000 00000 011000

331 W07.33 Fall 2003

Review: MIPS ISA, so far

lo = $s1/$s2, rem. in hidivu $s1, $s20 and 27divide unsigned

lo = $s1/$s2, rem. in hidiv $s1, $s20 and 26divide

hi || lo = $s1 * $s2multu $s1, $s20 and 25multiply unsigned

hi || lo = $s1 * $s2mult $s1, $s20 and 24multiply

$s1 = $s2 xor 6xori $s1, $s2, 614xor immediate

$s1 = $s2 xor $s3xor $s1, $s2, $s30 and 38xor$s1 = !($s2 | $s2)nor $s1, $s3, $s30 and 39nor

$s1 = $s2 | 6ori $s1, $s2, 613or immediate

Logical

(R & I format)

$s1 = $s2 | $s3or $s1, $s2, $s30 and 37or

$s1 = $s2 & 6andi $s1, $s2, 612and immediate

$s1 = $s2 & $s3and $s1, $s2, $s30 and 36and

$s1 = $s2 + 6addiu $s1, $s2, 69add imm. unsigned

$s1 = $s2 + 6addi $s1, $s2, 68add immediate

$s1 = $s2 - $s3subu $s1, $s2, $s30 and 35subt unsigned

$s1 = $s2 - $s3sub $s1, $s2, $s30 and 34subtract

0 and 330 and 32Op Code

$s1 = $s2 + $s3addu $s1, $s2, $s3add unsigned$s1 = $s2 + $s3add $s1, $s2, $s3addArithmeti

c

(R & I format)

MeaningExampleInstrCategory

331 W07.34 Fall 2003

Review: MIPS ISA, so far con’t

$s1 = lomflo $s10 and 18move from lo

$s1 = $s2 << 4sll $s1, $s2, 40 and 0sllShift

(R format) $s1 = $s2 >> 4srl $s1, $s2, 40 and 2srl$s1 = $s2 >> 4sra $s1, $s2, 40 and 3sra

lo = $s1mtlo $s10 and 19move to lo

hi = $s1mthi $s10 and 17move to hi

$s1 = himfhi $s10 and 16move from hi

$s1 = Memory($s2+24)lw $s1, 24($s2)35load wordData Transfer

(I format)Memory($s2+24) = $s1sw $s1, 24($s2)43store word$s1 = Memory($s2+25)lb $s1, 25($s2)32load byte$s1 = Memory($s2+25)lbu $s1, 25($s2)36load byte unsignedMemory($s2+25) = $s1sb $s1, 25($s2)40store byte$s1 = 6 * 216lui $s1, 615load upper imm

Op Code MeaningExampleInstrCategory

331 W07.35 Fall 2003

Review: MIPS ISA, so far con’t

go to 10000; $ra=PC+4jal 25003jump and link

if ($s2<6) $s1=1 else $s1=0

slti $s1, $s2, 610set on less than immediate

go to $s1, $s2=PC+4jalr $s1, $s20 and 9jump and link reg

if ($s2<6) $s1=1 else $s1=0

sltiu $s1, $s2, 611set on less than imm. unsigned

if ($s2<$s3) $s1=1 else $s1=0

sltu $s1, $s2, $s3

0 and 43

set on less than unsigned

go to $s1jr $s10 and 8jump register

go to 10000j 25002jumpUncond. Jump (J & R format)

if ($s2<$s3) $s1=1 else $s1=0

slt $s1, $s2, $s30 and 42

set on less thanif ($s1 !=$s2) go to Lbne $s1, $s2, L5br on not equalif ($s1==$s2) go to Lbeq $s1, $s2, L 4br on equalCond.

Branch

(I & R format)

Op Code MeaningExampleInstrCategory

Date post:	28-Feb-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

CSE 331. Computer Organizationyyzhang/fall03/notes/331-week7.pdf · 2003. 10. 19. · 331 W07.4...

Documents