CO&ISA, NLT 2013
1
Computer Organization and Instruction Set Architecture
Ngo Lam Trung Department of Computer Engineering
School of Information and Communication Technology (SoICT)
Hanoi University of Science and Technology
E-mail: [email protected]
CO&ISA, NLT 2013
2
Course administration
Instructor: Ngo Lam Trung 505 B1, SoICT, HUST
TA: Pham Ngoc Hung
Text: [Required] Computer Organization and Design, 4th edition revised printing Patterson & Hennessy 2012.
[Optional] Computer Organization and Architecture, 8th Edition, William Stalling
Slides: hard copy (how to print?) pdf on class website (URL TBA)
Schedule: as in timetable
CO&ISA, NLT 2013
3
Self introduction
Ngo Lam Trung
Current position
2004~: Lecturer, Department of Computer Engineering, School of Information and Communication Technology, Hanoi University of Science and Technology.
2012~: Visiting researcher, Shibaura Institute of Technology, Tokyo, Japan.
2013~: Head of Computer Systems Laboratory, School of Information and Communication Technology, Hanoi University of Science and Technology.
Educational background
MSc in Information Processing and Communication, Hanoi University of Technology, 2006.
PhD in Functional Control Systems, Shibaura Institute of Technology, Japan, 2012.
CO&ISA, NLT 2013
4
CA & ISA 2013-2014 in USTH
CA & ISA 2013-2014 characteristics
Class of approx. 80 students (big class)
Students are all freshmen (little background in ICT)
Students are from all 6 undergrad. programs (different majors)
The lack of traditional prerequisites: digital logic, HDL
CO&ISA, NLT 2013
5
Grading information
Grading criteria
Attendance/Attitude 10%
Practical exercise(s) 10%
Assignment(s) 10%
Mid-term test (Jan 14th) 20%
Final exam (TBA) 50%
Assignments (no late submission)
Dec 27th due Dec 29th
Jan 9th due Jan 12th
Jan 21th due Jan 24th
Other requirements
No music/video/surfing in class
Copy n paste from unreferenced sources
CO&ISA, NLT 2013
6
Why you need this course?
Of course, its in USTH curriculum!!!
More than that, this course is helpful for you
Software developer: to write better programs
Hardware designer: to make better computer
User: to know which computer is best suitable for your work
Expected outcomes: understanding of
Organization and architecture of modern computer.
Relationship between hardware and software.
Factors for computer performance evaluation.
CO&ISA, NLT 2013
7
Course content
Chapter 1: Introduction
1.1 Background
1.2 Computer Abstraction and Technology
1.3 Performance Evaluation
Chapter 2: Instruction Set Architecture
2.1 Overview
2.2 MIPS instruction set
2.3 MIPS organization
Practice and assignment 1
CO&ISA, NLT 2013
8
Course content
Chapter 3: Computer Arithmetic
3.1 Integer arithmetic
3.2 Floating point arithmetic
Chapter 4: CPU
4.1 Introduction
4.2 Simple CPU implementation
4.3 Enhancing performance with pipelining
Practice and assignment 2
CO&ISA, NLT 2013
9
Course content
Chapter 5: Memory
5.1 Memory hierarchy
5.2 Cache
5.3 Virtual memory
Chapter 6: I/O System
Practice and assignment 3
Textbook and references
1. [Textbook] Computer Organization and Design, 4th Edition, Patterson & Hennessy, MK Pub, 2008.
2. Computer Architecture: From Microprocessors to Supercomputers, Behrooz Parhami, Oxford Univ. Press, New York, 2005.
3. Computer Architecture and Organization, 7th Edition, William Stallings, Prentice Hall International, 2006.
Acknowledgement: This slide uses materials from three above books
CO&ISA, NLT 2013
10
Prerequisites - What You Should Know
Basic PC organization
How computer parts look like?
Basic logic design & machine organization
logical minimization, FSMs, component design
Create, compile, and run C/C++ programs
Create, run, debug programs in an assembly language
What if all you know is MS Word, VLC, NFS, FB?
Try harder!
CO&ISA, NLT 2013
11
Chapter 1: Introduction
1. Background: Basic logic design
2. Computer Abstraction and Technology
3. Performance Evaluation
CO&ISA, NLT 2013
12
1. Background: Basic logic design
Whats inside a computer?
What is the most basic building block of a computer?
The first basic thing: Computer systems are built based on digital logic circuit
CO&ISA, NLT 2013
13
Background: Basic logic design
Reviewing topics
Signal, logic operators, Boolean functions
Combinational circuits (memory-less)
Sequential circuits (with memory)
CO&ISA, NLT 2013
14
Signal, logic operators, Boolean function
Signal:
Physical: electric signal (usually voltage) run in electronic circuits
Logic: states of corresponding physical signal
Computer uses binary logic: 2 logic levels
Logical high: binary 1 (bit 1)
Logical low: binary 0 (bit 0)
Optional: high impedance state (output device is not controlled)
Example:
TTL: logical low = 0-0.8 V logical high = 2-5 V
CO&ISA, NLT 2013
15
Signal, logic operators, Boolean function
Logic gates: accept one or several logic input and produce one logic output
Some basic logic gates
CO&ISA, NLT 2013
16
Variation of logic gates
Gates with more than one inputs and inverted input, output
OR NOR NAND AND XNOR
Simplify logic circuit design
CO&ISA, NLT 2013
17
Array of logic gates and equivalent symbol
Equivalent single gate
symbol
CO&ISA, NLT 2013
18
Example
What do these gates do?
Enable = 0: Output = 0
Enable = 1: Output = Input
Controlled data transfer
Compl = 0: Output = Input
Compl = 1: Output = 11..11 Input
1s complement circuit
CO&ISA, NLT 2013
19
Combinational logic circuit
Represent logic functions, output only depends on current inputs
Ways to represent logic functions
Truth table: 2n rows, n is number of input signals
Logic expression
Logic circuit diagram
= = ( ) ( ) ( ) = ( ) ( ) ( )
= = + + = + +
Truth table
Logic expression
Exercise:
Draw the corresponding logic circuit diagram
CO&ISA, NLT 2013
20
Draw logic circuit to produce Y1, Y2, Y3
= = + + = + +
CO&ISA, NLT 2013
21
Basic rules
CO&ISA, NLT 2013
22
Function minimization
Minimize the below function
CO&ISA, NLT 2013
23
AB
00
01
11
10
CD
00 0 0 1 1
01 0 0 1 1
11 0 0 0 1
10 0 1 1 1
Function minimization: Karnaugh map
Construct the
Karnaugh map
Adjacent 1s are
grouped by
rectangle and
square boxed of
2n tiles
F=BCD+AB+AC
CO&ISA, NLT 2013
24
Useful combinational circuits
Arithmetic components: adder, multiplier, ALU (next chapters).
Multiplexer, demultiplexer, encoder, decoder
What does this circuit do?
CO&ISA, NLT 2013
25
Multiplexer
CO&ISA, NLT 2013
26
Decoder/demultiplexer
y 1
y 0
x 0
x 3
x 2
x 1
1
0
3
2
y 1
y 0
x 0
x 3
x 2
x 1 e
1
0
3
2
y 1
y 0
x 0
x 3
x 2
x 1
(a) 2-to-4 decoder (b) Decoder symbol
(c) Demultiplexer, or decoder with enable
(Enable) 1
1 0
1
1 0
1 1
1
CO&ISA, NLT 2013
27
Sequential circuits
Output depends on current inputs, AND previous output
Basic circuit: flip-flop, latch
Output Q=D
CO&ISA, NLT 2013
28
Flip-flop operation
Synchronized with external clock signal
D Q
QCLK(Input sampled on rising edge) (Input sampled on falling edge)
Sample timing diagram of positive edge triggered D flip-flop
CO&ISA, NLT 2013
29
Flip-flop operation
What if we have this
homework
CO&ISA, NLT 2013
30
Sequential circuits
Useful circuits: Register file, RAM, Flash
Decod
er
/ k
/ k
/
h
Write enable
Read address 0
Read address 1
Read data 0
Write data
Read enable
2 k -bit registers h / k
/ k
/ k
/ k
/ k
/ k
/ k
/ h
Write address
Muxes
Read data 1
/
k
/
h
/
h
/
h
/
k
/
h
Write enable
Read addr 0
/
k
/
k
Read addr 1
Write data Write addr
Read data 0
Read enable
Read data 1
(a) Register file with random access
(b) Graphic symbol for register file
Q C
Q
D
FF
/ k
Q C
Q
D
FF
Q C
Q
D
FF
Q C
Q
D
FF
/
k
Push
/
k
Input
Output Pop
Full
Empty
(c) FIFO symbol
CO&ISA, NLT 2013
31
Exercise
Design a 1-bit 4-to-1 multiplexer without enable signal, using basic logic gates
CO&ISA, NLT 2013
32
2. Computer Abstraction and Technology
Computer architecture: attributes of a system
Visible to a programmer
Or have a direct impact on the logical execution of a program.
Computer organization
The operational units and their interconnections that realize the architectural specifications.
include those hardware details transparent to the programmer,
- control signals;
- interfaces between the computer and peripherals;
- and the memory technology used.
Which is more hardware/software related?
CO&ISA, NLT 2013
34
Who we are?
What do you want to do in the future?
High-
level
view
Com
pute
r desig
ne
r
Circu
it d
esig
ne
r
Applic
ati
on d
esig
ne
r
Syste
m d
esig
ner
Log
ic d
esig
ne
r
Software
Hardware
Computer organization
Low-
level
view
Applic
ati
on d
om
ain
s
Ele
ctr
on
ic c
om
pon
ents
Computer architecture
Subfields or views in computer system engineering
CO&ISA, NLT 2013
35
Classes of Computers
Supercomputers
Super fast + expensive for high-end applications
Server computers
Network based
High capacity, performance, reliability
Range from small servers to building sized
Desktop computers
General purpose, variety of software
Subject to cost/performance tradeoff
Embedded computers
Hidden as components of systems
Stringent power/performance/cost constraints
CO&ISA, NLT 2013
36
Dominan look and feel of computer classes
Embedded
PC
Server
Super computer
CO&ISA, NLT 2013
37
Price/performance of computer classes
Embedded Personal
Workstation
Server
Mainframe
Super $Millions $100s Ks
$10s Ks
$1000s
$100s
$10s
Differences in scale,
not in substance
CO&ISA, NLT 2013
38
The Processor Market
embedded growth >> desktop growth
CO&ISA, NLT 2013
39
Generations of Progress
Generation
(begun)
Processor
technology
Memory
innovations
I/O devices
introduced
Dominant
look & fell
0 (1600s) (Electro-)
mechanical
Wheel, card Lever, dial,
punched card
Factory
equipment
1 (1950s) Vacuum tube Magnetic
drum
Paper tape,
magnetic tape
Hall-size
cabinet
2 (1960s) Transistor Magnetic
core
Drum, printer,
text terminal
Room-size
mainframe
3 (1970s) SSI/MSI RAM/ROM
chip
Disk, keyboard,
video monitor
Desk-size
mini
4 (1980s) LSI/VLSI SRAM/DRAM Network, CD,
mouse,sound
Desktop/
laptop micro
5 (1990s) ULSI/GSI/
WSI, SOC
SDRAM,
flash
Sensor/actuator,
point/click
Invisible,
embedded
[Ref. 2]
CO&ISA, NLT 2013
40
Technology Trends
Electronics technology continues to evolve
Increased capacity and performance
Reduced cost
Year Technology Relative performance/cost
1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2005 Ultra large scale IC 6,200,000,000
DRAM capacity
[Textbook]
CO&ISA, NLT 2013
41
IC making
15-30 cm
30-60 cm
Silicon crystal ingot
Slicer
Processing: 20-30 steps
Blank wafer with defects
x x x x x x x
x x x x
0.2 cm
Patterned wafer
(100s of simple or scores of complex processors)
Dicer Die
~1 cm
Good die
~1 cm
Die tester
Microchip or other part
Mounting
Part tester
Usable part
to ship
The manufacturing process for an IC part
CO&ISA, NLT 2013
42
Video: How an IC is made
CO&ISA, NLT 2013
43
Moores Law
1Mb
1990 1980 2000 2010
kIPS
MIPS
GIPS
TIPS P
rocessor
pe
rfo
rma
nce
Calendar year
80286 68000
80386
80486
68040
Pentium
Pentium II
R10000
1.6 / yr
10 / 5 yrs
2 / 18 mos
64Mb
4Mb
64kb
256kb
256Mb
1Gb
16Mb
4 / 3 yrs
Processor
Memory
kb
Mb
Gb
Tb
Me
mo
ry c
hip
cap
acity
How do we benefit from this?
CO&ISA, NLT 2013
44
Courtesy, Intel
Dual Core
Itanium with
1.7B
transistors
feature size
&
die size
CO&ISA, NLT 2013
45
Cost/performance
1980 1960 2000 2020 $1
Co
mp
ute
r co
st
Calendar year
$1 K
$1 M
$1 G
CO&ISA, NLT 2013
46
Homework
Why CPU speed does not exceed 3.5-4GHz?
CO&ISA, NLT 2013
47
Computer Organization
Five classic components of a computer input, output, memory, datapath, and control
datapath + control = processor (CPU)
CO&ISA, NLT 2013
48
A similar view
Usually, the Link unit is hidden
Memory
Link Input/Output
To/from network
Processor
Control
Datapath
Input
Output
CPU I/O
CO&ISA, NLT 2013
49
Opening the box: anatomy of computer
CO&ISA, NLT 2013
50
Opening the box: anatomy of computer
The story of each component
worth a separate course!
CO&ISA, NLT 2013
51
Inside the Processor (CPU)
Datapath: performs operations on data
Control: sequences datapath, memory, ...
Cache memory
Small fast SRAM memory for immediate access to data
CO&ISA, NLT 2013
52
Inside the Processor
AMD Barcelona: 4 processor cores
CO&ISA, NLT 2013
53 http://www.techwarelabs.com/reviews/processors/barcelona/
Four out-of-order cores on one chip
1.9 GHz clock rate
65nm technology
Three levels of caches (L1, L2, L3) on chip
Integrated Northbridge
AMDs Barcelona Multicore Chip
Core 1 Core 2
Core 3 Core 4
Northbridge
512K
B L
2
512K
B L
2
51
2K
B L
2
51
2K
B L
2
2M
B s
hare
d L
3 C
ac
he
CO&ISA, NLT 2013
54
Hardware/software interface: below your program
Application software
Written in high-level language (HLL)
System software
Compiler: translates HLL code to machine code
Operating System: service code
- Handling input/output
- Managing memory and storage
- Scheduling tasks & sharing resources
Hardware
Processor, memory, I/O controllers
CO&ISA, NLT 2013
55
Below your program
High-level language program (in C) swap (int v[], int k) (int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
)
Assembly language program (for MIPS) swap: sll $2, $5, 2
add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
Machine (object, binary) code (for MIPS) 000000 00000 00101 0001000010000000
000000 00100 00010 0001000000100000
. . .
C compiler
assembler
one-to-many
one-to-one
CO&ISA, NLT 2013
56
Levels of Program Code
High-level language
Level of abstraction closer to problem domain
Provides for productivity and portability
Assembly language
Textual representation of instructions
Hardware representation
Binary digits (bits)
Encoded instructions and data
CO&ISA, NLT 2013
57
Advantages of Higher-Level Languages ?
Higher-level languages
As a result, very little programming is done today at the assembler level
Allow the programmer to think in a more natural language and for their intended use (Fortran for scientific computation, Cobol for business programming, Lisp for symbol manipulation, Java for web programming, )
Improve programmer productivity more understandable code that is easier to debug and validate
Improve program maintainability
Allow programs to be independent of the computer on which they are developed (compilers and assemblers can translate high-level language programs to the binary instructions of any machine)
Emergence of optimizing compilers that produce very efficient assembly code optimized for the target machine
CO&ISA, NLT 2013
58
Problem
Computer organization evolves fast
New processor
New hardware
every year/day/month
How a program/software can run on different computers?
Instruction Set Architecture
Abstraction
CO&ISA, NLT 2013
60
Computer performance
To maximize performance, need to minimize execution time
performanceX = 1 / execution_timeX
If computer X is n times faster than Y, then
performanceX execution_timeY -------------------- = --------------------- = n
performanceY execution_timeX
CO&ISA, NLT 2013
61
Relative Performance Example
If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B?
We know that A is n times faster than B if
performanceA execution_timeB -------------------- = --------------------- = n
performanceB execution_timeA
15 ------ = 1.5
10
The performance ratio is
So A is 1.5 times faster than B
CO&ISA, NLT 2013
62
Performance Factors
CPU execution time (CPU time) time the CPU spends working on a task
Does not include time waiting for I/O or running other programs
CPU execution time # CPU clock cycles
for a program for a program = x clock cycle time
# CPU clock cycles for a program
clock rate = -------------------------------------------
Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program
CO&ISA, NLT 2013
63
Review: Machine Clock Rate
Clock rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time (clock period)
CC = 1 / CR
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10-9) clock cycle => 1 GHz (109) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
CO&ISA, NLT 2013
64
Improving Performance Example
A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must computer B run at to run this program in 6 seconds? Assume that, computer B will require 1.2 times as many clock cycles as computer A to run the program.
CPU timeA CPU clock cyclesA clock rateA = -------------------------------
CPU clock cyclesA = 10 sec x 2 x 109 cycles/sec
= 20 x 109 cycles
CPU timeB 1.2 x 20 x 109 cycles
clock rateB = -------------------------------
clock rateB 1.2 x 20 x 109 cycles
6 seconds = ------------------------------- = 4 GHz
CO&ISA, NLT 2013
65
Clock Cycles per Instruction
Not all instructions take the same amount of time to execute
Average execution time ~ average clock cycles per instruction
Clock cycles per instruction (CPI) the average number of clock cycles each instruction takes to execute
A way to compare two different implementations of the same ISA
# CPU clock cycles # Instructions Average clock cycles
for a program for a program per instruction = x
CPI for this instruction class
A B C
CPI 1 2 3
CO&ISA, NLT 2013
66
Using the Performance Equation
Computers A and B implement the same ISA. Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program and computer B has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program. Which computer is faster and by how much?
Each computer executes the same number of instructions, I, so
CPU timeA = I x 2.0 x 250 ps = 500 x I ps
CPU timeB = I x 1.2 x 500 ps = 600 x I ps
Clearly, A is faster by the ratio of execution times
performanceA execution_timeB 600 x I ps ------------------- = --------------------- = ---------------- = 1.2
performanceB execution_timeA 500 x I ps
CO&ISA, NLT 2013
67
The Performance Equation
Our basic performance equation is then calculated
CPU time = Instruction_count x CPI x clock_cycle
Instruction_count x CPI
clock_rate = -----------------------------------------------
Key factors that affect performance (CPU execution time)
The clock rate: CPU specification
CPI: varies by instruction type and ISA implementation
Instruction count: measure by using profilers/ simulators
CO&ISA, NLT 2013
68
Improving performance by CPI
How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
What if branch instruction is only one cycle?
What if two ALU instructions could be executed at once?
Op Freq CPIi Freq x CPIi
ALU 50% 1
Load 20% 5
Store 10% 3
Branch 20% 2
= =
.5
1.0
.3
.4
2.2
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
1.6
.5
.4
.3
.4
.5
1.0
.3
.2
2.0
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
.25
1.0
.3
.4
1.95
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
CO&ISA, NLT 2013
69
Dynamic Instruction Count
250 instructions
for i = 1, 100 do
20 instructions
for j = 1, 100 do
40 instructions
for k = 1, 100 do
10 instructions
endfor
endfor
endfor
How many
instructions are
executed in this
program fragment?
Each for consists of two instructions: increment index,
check exit condition
2 + 40 + 1200 instructions
100 iterations
124,200 instructions in all
2 + 10 instructions
100 iterations
1200 instructions in
all
2 + 20 + 124,200 instructions
100 iterations
12,422,200 instructions in all
12,422,450 Instructions
for i = 1, n
while x > 0
Static count = 326
CO&ISA, NLT 2013
70
How to improve performance?
Shorter clock cycle = faster clock rate
latest CPU technology
Smaller CPI
optimizing Instruction Set Architecture
Smaller instruction count
optimizing algorithm and compiler
To get best performance, multiple criteria are combined and considered at design time
specific CPU for specific class computation problem
CO&ISA, NLT 2013
71
Faster steps do not necessarily mean
shorter travel time.
Faster Clock Shorter Running Time
1 GHz
2 GHz
4 steps
Solution
20 steps
Suppose addition takes 1 ns
Clock period = 1 ns; 1 cycle
Clock period = ns; 2 cycles
In this example, addition time
does not improve in going from
1 GHz to 2 GHz clock
CO&ISA, NLT 2013
72
Review
Digital logic design
Computer technology
Computer performance evaluation
Next week: MIPSs instruction set