11998 Morgan Kaufmann Publishers
• How to measure, report, and summarize performance?• What factors determine the performance of a computer?• Critical to purchase and design decisions
– best performance?– least cost?– best performance/cost?
• Questions:Why is some hardware better than others for different programs?
What factors of system performance are hardware related?(e.g., Do we need a new machine, or a new operating system?)
How does the machine's instruction set affect performance?
Performance
21998 Morgan Kaufmann Publishers
• Response Time (execution time)
— The time between the start and completion of a task
• Throughput
— The total amount of work done in a given time
• Q: If we replace the processor with a faster one, what do we increase?
A: Response time and throughput
• Q: If we add an additional processor to a system, what do we increase?
A: Throughput
Computer Performance
31998 Morgan Kaufmann Publishers
• For some program running on machine X,
PerformanceX = 1 / Execution timeX
• "X is n times faster than Y"
n = PerformanceX / PerformanceY
• Problem: Machine A runs a program in 10 seconds and machine B in 15 seconds. How much faster is A than B?
Answer: n = PerformanceA / PerformanceB
= Execution timeB/Execution timeA = 15/10 = 1.5
A is 1.5 times faster than B.
Book's Definition of Performance
41998 Morgan Kaufmann Publishers
• Elapsed Time, wall-clock time or response time
– counts everything (disk and memory accesses, I/O , etc.)
– a useful number, but often not good for comparison purposes
• CPU time
– doesn't count I/O or time spent running other programs
– can be broken up into system time, and user time
• Our focus: user CPU time
– time spent executing the lines of code that are "in" our program
Execution Time
51998 Morgan Kaufmann Publishers
Clock Cycles
• Instead of reporting execution time in seconds, we often use cycles
• Execution time = # of clock cycles • cycle time
• Clock “ticks” indicate when to start activities (one abstraction):
• cycle time (period) = time between ticks = seconds per cycle• clock rate (frequency) = cycles per second (1 Hz = 1 cycle/sec)
A 200 MHz clock has a cycle time
time
seconds
program
cycles
program
seconds
cycle
ns 5 610200
1
61998 Morgan Kaufmann Publishers
So, to improve performance (everything else being equal) you can either
– reduce the # of required clock cycles for a program
– decrease the clock period or, said another way,
increase the clock frequency.
How to Improve Performance
seconds
program
cycles
program
seconds
cycle
71998 Morgan Kaufmann Publishers
• Multiplication takes more time than addition
• Floating point operations take longer than integer ones
• Accessing memory takes more time than accessing registers
• Important point: changing the cycle time often changes the number of cycles required for various instructions (more later)
• Another point: the same instruction might require a different number of cycles on a different machine
time
Different numbers of cycles for different instructions
81998 Morgan Kaufmann Publishers
• A program runs in 10 seconds on computer A, which has a 400 MHz clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new technology to substantially increase the clock rate, but this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A. What clock rate should we tell the designer to target?”
• Clock cyclesA = 10 s * 400 MHz = 4*109 cycles
Clock cyclesB = 1.2 * 4*109 cycles = 4.8 *109 cycles
Execution time = # of clock cycles * cycle time
Clock rateB = Clock cyclesB / Execution timeB
= 4.8 *109 cycles / 6 s = 800 MHz
Example
91998 Morgan Kaufmann Publishers
• A given program will require
– some number of instructions (machine instructions)
– some number of cycles
– some number of seconds
• We have a vocabulary that relates these quantities:
– cycle time (seconds per cycle)
– clock rate (cycles per second)
– CPI (cycles per instruction) AVERAGE VALUE!
a floating point intensive application might have a higher CPI
– MIPS (millions of instructions per second)
this would be higher for a program using simple instructions
Now that we understand cycles
101998 Morgan Kaufmann Publishers
Performance
• Performance is determined by execution time
• Related variables
– # of cycles to execute program
– # of instructions in program
– # of cycles per second
– average # of cycles per instruction
– average # of instructions per second
• Common pitfall: thinking one of the variables is indicative of performance when it really isn’t.
111998 Morgan Kaufmann Publishers
• Suppose we have two implementations of the same instruction set architecture (ISA). For some program,
Machine A has a clock cycle time of 10 ns and a CPI of 2.0 Machine B has a clock cycle time of 20 ns and a CPI of 1.2
Which machine is faster for this program, and by how much?
• Time per instruction: for A 2.0 * 10 ns = 20 ns
B 1.2 * 20 ns = 24 ns
A is 24/20 = 1.2 times faster
• If two machines have the same ISA, which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical?
Answer: # of instructions
CPI Example
121998 Morgan Kaufmann Publishers
• A compiler designer has two alternatives for a certain code sequence.There are three different classes of instructions: A, B, and C, and they require one, two, and three cycles, respectively.
The first sequence has 5 instructions: 2 of A, 1 of B, and 2 of C.The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.
Which sequence will be faster? What are the CPI values?
• Sequence 1: 2*1+1*2+2*3 = 10 cycles; CPI1 = 10 / 5 = 2
• Sequence 2: 4*1+1*2+1*3 = 9 cycles; CPI2 = 9 / 6 = 1.5
• Sequence 2 is faster.
# of Instructions Example
131998 Morgan Kaufmann Publishers
MIPS
• Million Instructions Per Second
• MIPS = instruction count/(execution time*106)
• MIPS is easy to understand but– does not take into account the capabilities of the
instructions; the instruction counts of different instruction sets differ
– varies between programs even on the same computer
– can vary inversely with performance!
141998 Morgan Kaufmann Publishers
• Two compilers are being tested for a 100 MHz machine with three different classes of instructions: A, B, and C, which require one, two, and three cycles, respectively. Compiler 1: Compiled code uses 5 million Class A, 1 million Class B, and 1 million Class C instructions.Compiler 2: Compiled code uses 10 million Class A, 1 million Class B, and 1 million Class C instructions.
• Which sequence will be faster according to MIPS?
• Which sequence will be faster according to execution time?
MIPS example
151998 Morgan Kaufmann Publishers
• Cycles and instructions
1: 10 million cycles, 7 million instructions
2: 15 million cycles, 12 million instructions
• Execution time = Clock cycles/Clock rate
• Execution time1 = 10*106 / 100*106 = 0.1 s
• Execution time2 = 15*106 / 100*106 = 0.15 s
• MIPS = Instruction count/(Execution time *106)
• MIPS1 = 7*106 / 0.1*106 = 70
• MIPS2 = 12*106 / 0.15*106 = 80
MIPS example
161998 Morgan Kaufmann Publishers
• Performance best determined by running a real application
– Use programs typical of expected workload
– Or, typical of expected class of applicationse.g., compilers/editors, scientific applications, graphics, etc.
• Small benchmarks
– nice for architects and designers
– easy to standardize
– can be abused
• SPEC (System Performance Evaluation Cooperative)
– companies have agreed on a set of real programs and inputs
– can still be abused
– valuable indicator of performance (and compiler technology)
Benchmarks
171998 Morgan Kaufmann Publishers
SPEC ‘95
Benchmark Description
go Artificial intelligence; plays the game of Gom88ksim Motorola 88k chip simulator; runs test programgcc The Gnu C compiler generating SPARC codecompress Compresses and decompresses file in memoryli Lisp interpreterijpeg Graphic compression and decompressionperl Manipulates strings and prime numbers in the special-purpose programming language Perlvortex A database program
tomcatv A mesh generation programswim Shallow water model with 513 x 513 gridsu2cor quantum physics; Monte Carlo simulationhydro2d Astrophysics; Hydrodynamic Naiver Stokes equationsmgrid Multigrid solver in 3-D potential fieldapplu Parabolic/elliptic partial differential equationstrub3d Simulates isotropic, homogeneous turbulence in a cubeapsi Solves problems regarding temperature, wind velocity, and distribution of pollutantfpppp Quantum chemistrywave5 Plasma physics; electromagnetic particle simulation
181998 Morgan Kaufmann Publishers
SPEC ‘89
• Compiler effects on performance depend on applications.
0
100
200
300
400
500
600
700
800
tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc
BenchmarkCompiler
Enhanced compiler
SP
EC
pe
rfo
rman
ce r
atio
191998 Morgan Kaufmann Publishers
SPEC ‘95
Organisational enhancements enhance performance.
Doubling the clock rate does not double the performance.
Clock rate (MHz)
SP
EC
int
2
0
4
6
8
3
1
5
7
9
10
200 25015010050
Pentium
Pentium Pro
PentiumClock rate (MHz)
SP
EC
fp
Pentium Pro
2
0
4
6
8
3
1
5
7
9
10
200 25015010050
201998 Morgan Kaufmann Publishers
Version 1
Execution Time After Improvement =
Execution Time Unaffected +
Execution Time Affected / Amount of Improvement
Version 2
Speedup
= Performance after improvement / Performance before improvement
= Execution time before improvement/ Execution time after improvement
Execution time: before n + a
after n + a/p
Principle: Make the common case fast
Amdahl's Law
sun a
nap
211998 Morgan Kaufmann Publishers
Example:Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"
100 s/4 = 80 s/n + 20 s
5 s = 80s/n
n= 80 s/ 5 s = 16
Amdahl's Law
221998 Morgan Kaufmann Publishers
Example:A benchmark program spends half of the time executing floating point instructions.
We improve the performance of the floating point unit by a factor of four.
What is the speedup?
Time before 10s
Time after = 5s + 5s/4 = 6.25 s
Speedup = 10/6.25 = 1.6
Amdahl's Law
231998 Morgan Kaufmann Publishers
Machine Instructions:
• Language of the Machine
• Lowest level of programming, control directly the hardware
• Assembly instructions are symbolic versions of machine instructions
• More primitive than higher level languages
• Very restrictive
• Programs are stored in the memory, one instruction is fetched and executed at a time
• We’ll be working with the MIPS instruction set architecture
241998 Morgan Kaufmann Publishers
MIPS instruction set:
• Load from memory
Store in memory
• Logic operations
– and, or, negation, shift, ...
• Arithmetic operations
– addition, subtraction, ...
• Branch
251998 Morgan Kaufmann Publishers
Instruction types:
• 1 operand
Jump #address
Jump $register number
• 2 operands
Multiply $reg1, $reg2
• 3 operands
Add $reg1, $reg2, $reg3
261998 Morgan Kaufmann Publishers
MIPS arithmetic
• Instructions have 3 operands
• Operand order is fixed (destination first)
Example:
C code: A = B + C
MIPS code: add $s0, $s1, $s2
$s0, etc. are registers
(associated with variables by compiler)
271998 Morgan Kaufmann Publishers
MIPS arithmetic
• Design Principle 1: simplicity favours regularity.
• Of course this complicates some things...
C code: A = B + C + D;E = F - A;
MIPS code: add $t0, $s1, $s2add $s0, $t0, $s3sub $s4, $s5, $s0
• Operands must be registers, 32 registers provided
• Design Principle 2: smaller is faster.
281998 Morgan Kaufmann Publishers
Registers vs. Memory
Processor I/O
Control
Datapath
Memory
Input
Output
• Arithmetic instructions operands are registers
• Compiler associates variables with registers
• What about programs with lots of variables
291998 Morgan Kaufmann Publishers
Memory Organization
• Viewed as a large, single-dimension array, with an address.
• A memory address is an index into the array
• "Byte addressing" means that the index points to a byte of memory.
0
1
2
3
4
5
6
...
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
301998 Morgan Kaufmann Publishers
Memory Organization
• Bytes are nice, but most data items use larger "words"
• For MIPS, a word is 32 bits or 4 bytes.
• 232 bytes with byte addresses from 0 to 232-1
• 230 words with byte addresses 0, 4, 8, ... 232-4
• Words are alignedi.e., the 2 least significant bits of a word address are equal to 0.
0
4
8
12
...
32 bits of data
32 bits of data
32 bits of data
32 bits of data
Registers hold 32 bits of data
311998 Morgan Kaufmann Publishers
Load and store instructions
• Example:
C code: A[8] = h + A[8];
MIPS code: lw $t0, 32($s3)add $t0, $s2, $t0sw $t0, 32($s3)
• word offset 8 equals byte offset 32
• Store word has destination last
• Remember arithmetic operands are registers, not memory!
321998 Morgan Kaufmann Publishers
So far we’ve learned:
• MIPS— loading and storing words but addressing bytes— arithmetic on registers only
• Instruction Meaning
add $s1, $s2, $s3 $s1 = $s2 + $s3sub $s1, $s2, $s3 $s1 = $s2 – $s3lw $s1, 100($s2) $s1 = Memory[$s2+100] sw $s1, 100($s2) Memory[$s2+100] = $s1
331998 Morgan Kaufmann Publishers
• Instructions, like registers and words of data, are also 32 bits long
• Example: add $t0, $s1, $s2
R-type instruction Format:
000000 10001 10010 01000 00000 100000
op rs rt rd shamt funct
op opcode, basic operation
rs 1st source reg.
rt 2nd source reg.
rd destination reg
shamt shift amount
funct function, selects the specific variant of the operation
Machine Language
341998 Morgan Kaufmann Publishers
• Introduce a new type of instruction format
– I-type for data transfer instructions
Example: lw $t0, 32($s2)
35 18 9 32
op rs rt 16 bit number
rt destination register
new instruction format but fields 1…3 are the same
• Design principle 3: Good design demands good compromises
Machine Language
351998 Morgan Kaufmann Publishers
• Instructions are groups of bits• Programs are stored in memory
— to be read or written just like data
• Fetch & Execute Cycle– Instructions are fetched and put into a special register– Bits in the register "control" the subsequent actions– Fetch the “next” instruction and continue
Processor Memory
memory for data, programs, compilers, editors, etc.
Stored Program Concept
361998 Morgan Kaufmann Publishers
• Decision making instructions– alter the control flow,– i.e., change the "next" instruction to be executed
• MIPS conditional branch instructions:
bne $t0, $t1, Label # branch if not equal
beq $t0, $t1, Label # branch if equal
• Example (if): if (i==j) h = i + j;
bne $s0, $s1, Labeladd $s3, $s0, $s1
Label: ....
Control
371998 Morgan Kaufmann Publishers
• MIPS unconditional branch instructions:j label
• Example (if - then - else):
if (i!=j) beq $s4, $s5, Label1 h=i+j; add $s3, $s4, $s5else j Label2 h=i-j; Label1: sub $s3, $s4, $s5
Label2: ...
Control
381998 Morgan Kaufmann Publishers
• Example (loop):
Loop: ----
i=i+j; if(i!=h) go to Loop
---
• Loop: ---
add $s1, $s1, $s2 #i=i+j
bne $s1, $s3, Loop
---
Control
391998 Morgan Kaufmann Publishers
So far:
• Instruction Meaning
add $s1,$s2,$s3 $s1 = $s2 + $s3sub $s1,$s2,$s3 $s1 = $s2 – $s3lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1bne $s4,$s5,LNext instr. is at Label if $s4 $s5beq $s4,$s5,LNext instr. is at Label if $s4 = $s5j Label Next instr. is at Label
• Formats:
op rs rt rd shamt funct
op rs rt 16 bit address
op 26 bit address
R
I
J
401998 Morgan Kaufmann Publishers
• We have: beq, bne, what about Branch-if-less-than?
• New instruction: set on less than
if $s1 < $s2 then $t0 = 1
slt $t0, $s1, $s2 else $t0 = 0
• slt and bne can be used to implement branch on less than
slt $t0, $s0, $s1
bne $t0, $zero, Less• Note that the assembler needs a register to do this,
there are register conventions for the MIPS assembly language
• we can now build general control structures
Control Flow
411998 Morgan Kaufmann Publishers
Name Register number Usage$zero 0 the constant value 0$v0-$v1 2-3 values for results and expression evaluation$a0-$a3 4-7 arguments$t0-$t7 8-15 temporaries$s0-$s7 16-23 saved$t8-$t9 24-25 more temporaries$gp 28 global pointer$sp 29 stack pointer$fp 30 frame pointer$ra 31 return address
MIPS Register Convention
• $at, 1 reserved for assembler
• $k0, $k1, 26-27 reserved for operating system
421998 Morgan Kaufmann Publishers
• Procedures and subroutines allow reuse and structuring of code
• Steps
– Place parameters in a place where the procedure can access them
– Transfer control to the procedure
– Acquire the storage needed for the procedure
– Perform the desired task
– Place the results in a place where the calling program can access them
– Return control to the point of origin
Procedure calls
431998 Morgan Kaufmann Publishers
• $a0...$a3 four argument registers for passing parameters
• $v0...$v1 two return value registers
• $ra return address register
• use of argument and return value register: compiler
• handling of control passing mechanism: machine
• jump and link instruction: jal ProcAddress
– saves return address (PC+4) in $ra (Program Counter holds the address of the current instruction)
– loads ProcAddress in PC
• return jump: jr $ra
– loads return address in PC
Register assignments for procedure calls
441998 Morgan Kaufmann Publishers
• Used if four argument registers and two return value registers are not enough or if nested subroutines (a subroutine calls another one) are used
• Can also contain temporary data
• The stack is a last-in-first-out structure in the memory
• Stack pointer ($sp) points at the top of the stack
• Push and pop instructions
• MIPS stack grows from higher addresses to lower addresses
Stack
451998 Morgan Kaufmann Publishers
bottom
topin
out
elementsin the stack SP
stackgrows
elementsin the stack
Stack and Stack Pointer
461998 Morgan Kaufmann Publishers
• Small constants are used quite frequently e.g., A = A + 5;
B = B - 1;
• Solution 1: put constants in memory and load them
To add a constant to a register:
lw $t0, AddrConstant($zero)
add $sp,$sp,$t0
• Solution 2: to avoid extra instructions keep the constant inside the instruction itself
addi $29, $29, 4 # i means immediateslti $8, $18, 10andi $29, $29, 6
• Design principle 4: Make the common case fast.
Constants
471998 Morgan Kaufmann Publishers
• We'd like to be able to load a 32 bit constant into a register
• Must use two instructions, new "load upper immediate" instruction
lui $t0, 1010101010101010
• Then must get the lower order bits right, i.e.,
ori $t0, $t0, 1010101010101010
1010101010101010 0000000000000000
0000000000000000 1010101010101010
1010101010101010 1010101010101010
ori
1010101010101010 0000000000000000
filled with zeros
How about larger constants?
481998 Morgan Kaufmann Publishers
• simple instructions all 32 bits wide
• very structured, no unnecessary baggage
• only three instruction formats
• rely on compiler to achieve performance— what are the compiler's goals?
• help compiler where we can
op rs rt rd shamt funct
op rs rt 16 bit address
op 26 bit address
R
I
J
Overview of MIPS
491998 Morgan Kaufmann Publishers
• Instructions:
bne $t4,$t5,Label Next instruction is at Label if $t4 $t5beq $t4,$t5,Label Next instruction is at Label if $t4 = $t5
j Label Next instruction is at Label
• Formats:
• Addresses are not 32 bits — How do we handle this with load and store instructions?
op rs rt 16 bit address
op 26 bit address
I
J
Addresses in Branches and Jumps
501998 Morgan Kaufmann Publishers
• Instructions:
bne $t4,$t5,Label Next instruction is at Label if $t4$t5beq $t4,$t5,Label Next instruction is at Label if $t4=$t5
• Formats:
• Could specify a register (like lw and sw) and add it to address– use Instruction Address Register (PC = program counter)– most branches are local (principle of locality)
• Jump instructions just use high order bits of PC – address boundaries of 256 MB
op rs rt 16 bit addressI
Addresses in Branches
511998 Morgan Kaufmann Publishers
• Register addressing
– operand in a register
• Base or displacement addressing
– operand in the memory
– address is the sumof a register and a constant in the instruction
• Immediate addressing
– operand is a constant within the instruction
• PC-relative addressing
– address is the sum of the PC and a constant in the instruction
– used e.g. in branch instructions
• Pseudodirect addressing
– jump address is the 26 bits of the instruction concatenated with the upper bits of the PC
• Additional addressing modes in other computers
MIPS addressing mode summary
521998 Morgan Kaufmann Publishers
Byte Halfword Word
Registers
Memory
Memory
Word
Memory
Word
Register
Register
1. Immediate addressing
2. Register addressing
3. Base addressing
4. PC-relative addressing
5. Pseudodirect addressing
op rs rt
op rs rt
op rs rt
op
op
rs rt
Address
Address
Address
rd . . . funct
Immediate
PC
PC
+
+
MIPS addressing mode summary
531998 Morgan Kaufmann Publishers
To summarize:MIPS operands
Name Example Comments$s0-$s7, $t0-$t9, $zero, Fast locations for data. In MIPS, data must be in registers to perform
32 registers $a0-$a3, $v0-$v1, $gp, arithmetic. MIPS register $zero always equals 0. Register $at is $fp, $sp, $ra, $at reserved for the assembler to handle large constants.
Memory[0], Accessed only by data transfer instructions. MIPS uses byte addresses, so
230
memory Memory[4], ..., sequential words differ by 4. Memory holds data structures, such as arrays,
words Memory[4294967292] and spilled registers, such as those saved on procedure calls.
MIPS assembly language
Category Instruction Example Meaning Commentsadd add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; data in registers
Arithmetic subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers
add immediate addi $s1, $s2, 100 $s1 = $s2 + 100 Used to add constants
load word lw $s1, 100($s2) $s1 = Memory[$s2 + 100] Word from memory to register
store word sw $s1, 100($s2) Memory[$s2 + 100] = $s1 Word from register to memory
Data transfer load byte lb $s1, 100($s2) $s1 = Memory[$s2 + 100] Byte from memory to register
store byte sb $s1, 100($s2) Memory[$s2 + 100] = $s1 Byte from register to memory
load upper immediate lui $s1, 100 $s1 = 100 * 216 Loads constant in upper 16 bits
branch on equal beq $s1, $s2, 25 if ($s1 == $s2) go to PC + 4 + 100
Equal test; PC-relative branch
Conditional
branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to PC + 4 + 100
Not equal test; PC-relative
branch set on less than slt $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1; else $s1 = 0
Compare less than; for beq, bne
set less than immediate
slti $s1, $s2, 100 if ($s2 < 100) $s1 = 1; else $s1 = 0
Compare less than constant
jump j 2500 go to 10000 Jump to target address
Uncondi- jump register jr $ra go to $ra For switch, procedure return
tional jump jump and link jal 2500 $ra = PC + 4; go to 10000 For procedure call
541998 Morgan Kaufmann Publishers
• Assembly provides convenient symbolic representation
– much easier than writing down numbers
– e.g., destination first
• Machine language is the underlying reality
– e.g., destination is no longer first
• Assembly can provide 'pseudoinstructions'
– e.g., “move $t0, $t1” exists only in Assembly
– would be implemented using “add $t0,$t1,$zero”
• When considering performance you should count real instructions
Assembly Language vs. Machine Language
551998 Morgan Kaufmann Publishers
• Design alternative:
– provide more powerful operations than found in MIPS
– goal is to reduce number of instructions executed
– danger is a slower cycle time and/or a higher CPI
• Sometimes referred to as “RISC vs. CISC”
– Reduced Instruction Set Computers
– Complex Instruction Set Computers
– virtually all new instruction sets since 1982 have been RISC
Alternative Architectures
561998 Morgan Kaufmann Publishers
Reduced Instruction Set Computers
• Common characteristics of all RISCs
– Single cycle issue
– Small number of fixed length instruction formats
– Load/store architecture
– Large number of registers
• Additional characteristics of most RISCs
– Small number of instructions
– Small number of addressing modes
– Fast control unit
571998 Morgan Kaufmann Publishers
An alternative architecture: 80x86
• 1978: The Intel 8086 is announced (16 bit architecture)
• 1980: The 8087 floating point coprocessor is added
• 1982: The 80286 increases address space to 24 bits, +instructions
• 1985: The 80386 extends to 32 bits, new addressing modes
• 1989-1995: The 80486, Pentium, Pentium Pro add a few instructions(mostly designed for higher performance)
• 1997: MMX is added
• Intel had a 16-bit microprocessor two years before its competitors’ more elegant architectures which led to the selection of the 8086 as the CPU for the IBM PC
• “This history illustrates the impact of the “golden handcuffs” of compatibility”
“an architecture that is difficult to explain and impossible to love”
581998 Morgan Kaufmann Publishers
A dominant architecture: 80x86
• See your textbook for a more detailed description
• Complexity:
– Instructions from 1 to 17 bytes long
– one operand must act as both a source and destination
– one operand can come from memory
– complex addressing modese.g., “base or scaled index with 8 or 32 bit displacement”
• Saving grace:
– the most frequently used architectural components are not too difficult to implement
– compilers avoid the portions of the architecture that are slow
591998 Morgan Kaufmann Publishers
• Instruction complexity is only one variable
– lower instruction count vs. higher CPI / lower clock rate
• Design Principles:
– simplicity favours regularity
– smaller is faster
– good design demands good compromises
– make the common case fast
• Instruction set architecture
– a very important abstraction indeed!
Summary
601998 Morgan Kaufmann Publishers
Arithmetic
• Where we've been:
– Performance (seconds, cycles, instructions)
– Abstractions: Instruction Set Architecture Assembly Language and Machine Language
• What's up ahead:
– Implementing the Architecture
611998 Morgan Kaufmann Publishers
Arithmetic
• We start with the Arithmetic Logic Unit
32
32
32
operation
result
a
b
ALU
621998 Morgan Kaufmann Publishers
• Bits are just bits (no inherent meaning)— conventions define relationship between bits and numbers
• Binary numbers (base 2)0000 0001 0010 0011 0100 0101 0110 0111 1000 1001...decimal: 0...2n-1
• Of course it gets more complicated:numbers are finite (overflow)fractions and real numbersnegative numbers
• How do we represent negative numbers?i.e., which bit patterns will represent which numbers?
• Octal and hexadecimal numbers
• Floating-point numbers
Numbers
631998 Morgan Kaufmann Publishers
• Sign Magnitude: One's Complement Two's Complement
000 = +0 000 = +0 000 = +0001 = +1 001 = +1 001 = +1010 = +2 010 = +2 010 = +2011 = +3 011 = +3 011 = +3100 = -0 100 = -3 100 = -4101 = -1 101 = -2 101 = -3110 = -2 110 = -1 110 = -2111 = -3 111 = -0 111 = -1
• Issues: balance, number of zeros, ease of operations.
• Two’s complement is best.
Possible Representations of Signed Numbers
641998 Morgan Kaufmann Publishers
• 32 bit signed numbers:
0000 0000 0000 0000 0000 0000 0000 0000two = 0ten
0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten
0000 0000 0000 0000 0000 0000 0000 0010two = + 2ten
...0111 1111 1111 1111 1111 1111 1111 1110two = + 2,147,483,646ten
0111 1111 1111 1111 1111 1111 1111 1111two = + 2,147,483,647ten
1000 0000 0000 0000 0000 0000 0000 0000two = – 2,147,483,648ten
1000 0000 0000 0000 0000 0000 0000 0001two = – 2,147,483,647ten
1000 0000 0000 0000 0000 0000 0000 0010two = – 2,147,483,646ten
...1111 1111 1111 1111 1111 1111 1111 1101two = – 3ten
1111 1111 1111 1111 1111 1111 1111 1110two = – 2ten
1111 1111 1111 1111 1111 1111 1111 1111two = – 1ten
maxint
minint
MIPS
651998 Morgan Kaufmann Publishers
• Negating a two's complement number: invert all bits and add 1
– Remember: “Negate” and “invert” are different operations.
You negate a number but invert a bit.
• Converting n bit numbers into numbers with more than n bits:
– MIPS 16 bit immediate gets converted to 32 bits for arithmetic
– copy the most significant bit (the sign bit) into the other bits
0010 -> 0000 0010
1010 -> 1111 1010
– "sign extension"
– MIPS load byte instructions
lbu: no sign extension
lb: sign extension
Two's Complement Operations
661998 Morgan Kaufmann Publishers
• Just like in grade school (carry/borrow 1s) 0111 0111 0110+ 0110 - 0110 - 0101 1101 0001 0001
• Two's complement operations easy
– subtraction using addition of negative numbers 0111+ 1010
10001
• Overflow (result too large for finite computer word):
– e.g., adding two n-bit numbers does not yield an n-bit number 0111+ 0001
1000
Addition & Subtraction
671998 Morgan Kaufmann Publishers
• No overflow when adding a positive and a negative number
• No overflow when signs are the same for subtraction
• Overflow occurs when the value affects the sign:
– overflow when adding two positives yields a negative
– or, adding two negatives gives a positive
– or, subtract a negative from a positive and get a negative
– or, subtract a positive from a negative and get a positive
• Consider the operations A + B, and A – B
– Can overflow occur if B is 0 ? No.
– Can overflow occur if A is 0 ? Yes.
Detecting Overflow
681998 Morgan Kaufmann Publishers
• An exception (interrupt) occurs
– Control jumps to predefined address for exception
– Interrupted address is saved for possible resumption
• Details based on software system / language
– example: flight control vs. homework assignment
• Don't always want to detect overflow— new MIPS instructions: addu, addiu, subu
note: addiu still sign-extends!note: sltu, sltiu for unsigned comparisons
Effects of Overflow
691998 Morgan Kaufmann Publishers
• and, andi: bit-by-bit AND
• or, ori: bit-by-bit OR
• sll: shift left logical
• slr: shift right logical
• 0101 1010
shifting left two steps gives 0110 1000
• 0110 1010
shifting right three bits gives 0000 1011
Logical Operations
701998 Morgan Kaufmann Publishers
• Let's build a logical unit to support the and and or instructions
– we'll just build a 1 bit unit, and use 32 of them
– op=0: and; op=1: or
• Possible Implementation (sum-of-products):
res = a • b + a • op + b • op
b
a
operation
result
Logical unit
op a b res0 0 0 00 0 1 00 1 0 00 1 1 11 0 0 01 0 1 11 1 0 11 1 1 1
711998 Morgan Kaufmann Publishers
• Selects one of the inputs to be the output, based on a control input
IEC symbol
of a 4-input
MUX:
• Lets build our logical unit using a MUX:
S
CA
B0
1
Review: The Multiplexor
0
1b
a
Operation
Result
MUX
0
1
0123
G03_
EN
721998 Morgan Kaufmann Publishers
• Not easy to decide the “best” way to build something– Don't want too many inputs to a single gate– Don’t want to have to go through too many gates– For our purposes, ease of comprehension is important– We use multiplexors
• Let's look at a 1-bit ALU for addition:
• How could we build a 1-bit ALU for AND, OR and ADD?
• How could we build a 32-bit ALU?
Different Implementations
cout = a b + a cin + b cin
sum = a xor b xor cin
Sum
CarryIn
CarryOut
a
b
731998 Morgan Kaufmann Publishers
Building a 32 bit ALU for AND, OR and ADD
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
Result31a31
b31
Result0
CarryIn
a0
b0
Result1a1
b1
Result2a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
We need a 4-input MUX.
741998 Morgan Kaufmann Publishers
• Two's complement approch: just negate b and add.
• A clever solution:
• In a multiple bit ALU the least significant CarryIn has to be equal to 1 for subtraction.
What about subtraction (a – b) ?
0
2
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b
751998 Morgan Kaufmann Publishers
• Need to support the set-on-less-than instruction (slt)
– remember: slt is an arithmetic instruction
– produces a 1 if rs < rt and 0 otherwise
– use subtraction: (a-b) < 0 implies a < b
• Need to support test for equality (beq $t5, $t6, $t7)
– use subtraction: (a-b) = 0 implies a = b
Tailoring the ALU to the MIPS
Supporting slt
• Other ALUs:
• Most significant ALU:
0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
0
3
Result
Operation
a
1
CarryIn
0
1
Binvert
b 2
Less
Set
Overflowdetection
Overflow
a.
b.
771998 Morgan Kaufmann Publishers
Seta31
0
ALU0 Result0
CarryIn
a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Binvert
CarryIn
Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
32 bit ALU supporting slt
a<b a-b<0, thusSet is the sign bitof the result.
781998 Morgan Kaufmann Publishers
Final ALU including test for equality
• Notice control lines:
000 = and001 = or010 = add110 = subtract111 = slt
•Note: zero is a 1 when the result is zero!
Seta31
0
Result0a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
791998 Morgan Kaufmann Publishers
Conclusion
• We can build an ALU to support the MIPS instruction set
– key idea: use multiplexor to select the output we want
– we can efficiently perform subtraction using two’s complement
– we can replicate a 1-bit ALU to produce a 32-bit ALU
• Important points about hardware
– all of the gates are always working
– the speed of a gate is affected by the number of inputs to the gate
– the speed of a circuit is affected by the number of gates in series(on the “critical path” or the “deepest level of logic”)
• Our primary focus: comprehension, however,– Clever changes to organization can improve performance
(similar to using better algorithms in software)– we’ll look at examples for addition, multiplication and division
801998 Morgan Kaufmann Publishers
• A 32-bit ALU is much slower than a 1-bit ALU.• There are more than one way to do addition.
– the two extremes: ripple carry and sum-of-products
Can you see the ripple? How could you get rid of it?
c1 = b0c0 + a0c0 + a0b0
c2 = b1c1 + a1c1 + a1b1 c2 = c2(a0,b0,c0,a1,b1)
c3 = b2c2 + a2c2 + a2b2 c3 = c3(a0,b0,c0,a1,b1,a2,b2)
c4 = b3c3 + a3c3 + a3b3 c4 = c4(a0,b0,c0,a1,b1,a2,b2,a3,b3)
Not feasible! Too many inputs to the gates.
Problem: ripple carry adder is slow
811998 Morgan Kaufmann Publishers
• An approach in-between the two extremes• Motivation:
– If we didn't know the value of carry-in, what could we do?
– When would we always generate a carry? gi = ai bi
– When would we propagate the carry? pi = ai + bi
– Look at the truth table!
• Did we get rid of the ripple?c1 = g0 + p0c0
c2 = g1 + p1c1 c2 = g1+p1g0+p1p0c0
c3 = g2 + p2c2 c3 = g2+p2g1+p2p1g0+p2p1p0c0
c4 = g3 + p3c3 c4 = ...
Feasible! A smaller number of inputs to the gates.
Carry-lookahead adder
821998 Morgan Kaufmann Publishers
1-bit adder
a b cin cout sum
0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1
831998 Morgan Kaufmann Publishers
• Can’t build a 16 bit CLA adder (too big)• Could use ripple carry of 4-bit CLA adders• Better: use the CLA principle again!
Principle shown in the figure. See textbook for details.
Use principle to build bigger adders
CarryIn
Result0--3
ALU0
CarryIn
Result4--7
ALU1
CarryIn
Result8--11
ALU2
CarryIn
CarryOut
Result12--15
ALU3
CarryIn
C1
C2
C3
C4
P0G0
P1G1
P2G2
P3G3
pigi
pi + 1gi + 1
ci + 1
ci + 2
ci + 3
ci + 4
pi + 2gi + 2
pi + 3gi + 3
a0b0a1b1a2b2a3b3
a4b4a5b5a6b6a7b7
a8b8a9b9
a10b10a11b11
a12b12a13b13a14b14a15b15
Carry-lookahead unit
841998 Morgan Kaufmann Publishers
• More complicated than addition
– can be accomplished via shifting and addition
• More time and more area
• Let's look at 2 versions based on grammar school algorithm
0010 (multiplicand)
x_1011 (multiplier)
0010 0010
0000 0010___ 0010110
• Negative numbers: easy way: convert to positive and multiply
– there are better techniques
Multiplication
851998 Morgan Kaufmann Publishers
Multiplication, First Version
Done
1. TestMultiplier0
1a. Add multiplicand to product andplace the result in Product register
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
Start
Multiplier0 = 0Multiplier0 = 1
No: < 32 repetitions
Yes: 32 repetitions
64-bit ALU
Control test
MultiplierShift right
ProductWrite
MultiplicandShift left
64 bits
64 bits
32 bits
861998 Morgan Kaufmann Publishers
Multiplication, Final Version
ControltestWrite
32 bits
64 bits
Shift rightProduct
Multiplicand
32-bit ALU
Done
1. TestProduct0
1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register
2. Shift the Product register right 1 bit
32nd repetition?
Start
Product0 = 0Product0 = 1
No: < 32 repetitions
Yes: 32 repetitions
871998 Morgan Kaufmann Publishers
Booth’s Algorithm
• The grammar school method was implemented using addition and shifting
• Booth’s algorithm also uses subtraction
• Based on two bits of the multiplier either add, subtract or do nothing; always shift
• Handles two’s complement numbers
881998 Morgan Kaufmann Publishers
Fast multipliers
• Combinational implementations– Conventional multiplier algorithm
• partial products with AND gates• adders
– Lots of modifications
• Sequential implementations– Pipelined multiplier
• registers between levels of logic• result delayed• effective speed of multiple multiplications increased
891998 Morgan Kaufmann Publishers
Four-Bit Binary Multiplication
Multiplicand B3 B2 B1 B0
Multiplier A3 A2 A1 A0
1st partial product A0B3 A0B2 A0B1 A0B0
2nd partial product A1B3 A1B2 A1B1 A1B0
3rd partial product A2B3 A2B2 A2B1 A2B0
4th partial product + A3B3 A3B2 A3B1 A3B0
Final product P7 P6 P5 P4 P3 P2 P1 P0
901998 Morgan Kaufmann Publishers
Classical Implementation
P7:0
PP1
PP2
PP3
PP4
&
&
&
&
A0B0A0B1A0B2A0B3
4/
4/
4/
4/
6/
6/
911998 Morgan Kaufmann Publishers
Pipelined Multiplier
Clk
/ /
/
/
/
/
/
/
/
/
921998 Morgan Kaufmann Publishers
Division
• Simple method:
– Initialise the remainder with the dividend
– Start from most significant end
– Subtract divisor from the remainder if possible (quotient bit 1)
– Shift divisor to the right and repeat
931998 Morgan Kaufmann Publishers
Division, First Version
64-bit ALU
Controltest
QuotientShift left
RemainderWrite
DivisorShift right
64 bits
64 bits
32 bits
941998 Morgan Kaufmann Publishers
Division, Final Version
Same hardware formultiply and divide.
Write
32 bits
64 bits
Shift leftShift right
Remainder
32-bit ALU
Divisor
Controltest
951998 Morgan Kaufmann Publishers
Floating Point (a brief look)
• We need a way to represent
– numbers with fractions, e.g., 3.1416
– very small numbers, e.g., .000000001
– very large numbers, e.g., 3.15576 109
• Representation:
– sign, exponent, significand: (–1)sign significand 2exponent
– more bits for significand gives more accuracy
– more bits for exponent increases range
• IEEE 754 floating point standard:
– single precision: 8 bit exponent, 23 bit significand
– double precision: 11 bit exponent, 52 bit significand
961998 Morgan Kaufmann Publishers
IEEE 754 floating-point standard
• Leading “1” bit of significand is implicit
• Exponent is “biased” to make sorting easier– all 0s is smallest exponent all 1s is largest– bias of 127 for single precision and 1023 for double precision– summary: (–1)sign significand) 2exponent – bias
• Example:
– decimal: -.75 = -3/4 = -3/22
– binary: -.11 = -1.1 x 2-1
– floating point: exponent = 126 = 01111110
– IEEE single precision: 10111111010000000000000000000000
971998 Morgan Kaufmann Publishers
Floating-point addition
1. Shift the significand of the number with the lesser exponent right until the exponents match
2. Add the significands
3. Normalise the sum, checking for overflow or underflow
4. Round the sum
981998 Morgan Kaufmann Publishers
Floating-point addition
991998 Morgan Kaufmann Publishers
Floating-point multiplication
1. Add the exponents
2. Multiply the significands
3. Normalise the product, checking for overflow or underflow
4. Round the product
5. Find out the sign of the product
1001998 Morgan Kaufmann Publishers
Floating Point Complexities
• Operations are somewhat more complicated (see text)
• In addition to overflow we can have “underflow”
• Accuracy can be a big problem
– IEEE 754 keeps two extra bits during intermediate calculations,
guard and round
– four rounding modes
– positive divided by zero yields “infinity”
– zero divide by zero yields “not a number”
– other complexities
• Implementing the standard can be tricky
1011998 Morgan Kaufmann Publishers
Chapter Four Summary
• Computer arithmetic is constrained by limited precision
• Bit patterns have no inherent meaning but standards do exist
– two’s complement
– IEEE 754 floating point
• Computer instructions determine “meaning” of the bit patterns
• Performance and accuracy are important so there are manycomplexities in real machines (i.e., algorithms and implementation).
• We are ready to move on (and implement the processor)