Computer Architecture& Related Topics
Computer Architecture“The architecture of a computer is the
interface between the machine and the software”
- Andris Padges
IBM 360/370 Architect
PresentationTopics
Computer Architecture History
Single Cpu Design
GPU Design (Brief)
Memory Architecture
Communications Architecture
Dual Processor Design
Parallel & Supercomputing Design
What is “Computer Architecture”?
I/O systemProcessor
CompilerOperating System
(Unix; Windows 9x)
Application (Netscape)
Digital DesignCircuit Design
Instruction Set Architecture
Key Idea: levels of abstractionhide unnecessary implementation detailshelps us cope with enormous complexity of real
systems
Datapath & Control
transistors, IC layout
MemoryHardware
Software Assembler
CS 161
What is “Computer Architecture”?Computer Architecture =
Instruction Set Architecture (ISA)
the one “true” language of a machine boundary between hardware and software the hardware’s specification; defines “what” a
machine does;
+Machine Organization
the “guts” of the machine; “how” the hardware works; the implementation; must obey the ISA abstraction
Part 1 History and Single Cpu
HISTORY!!!One of the first computing devices to come about was . .
The ABACUS!
• Completed:1946
• Programmed: plug board and switches
• Speed:5,000 operations per second
• Input/output: cards, lights, switches, plugs
• Floor space:1,000 square feet
The ENIAC : 1946
The EDSAC(1949) and The UNIVAC I(1951)
EDSAC
Technology:vacuum tubes
Memory:1K words
Speed:714 operations per second
First practical stored-program computer
UNIVAC
Speed:1,905 operations per second
Input/output:magnetic tape, unityper, printer
Memory size:1,000 12-digit words in delay lines
Memory type:delay lines, magnetic tape
Technology:serial vacuum tubes, delay lines, magnetic tape
Floor space:943 cubic feet
Cost:F.O.B. factory $750,000 plus $185,000 for a high speed printer
Lecture 1 11
Intel 4004 - 1971
• The first microprocessor
• 2,300 transistors• 108 KHz• 10m process
Lecture 1 13
Intel Pentium IV - 2001
• “State of the art”
• 42 million transistors
• 2GHz• 0.13m process
• Could fit ~15,000 4004s on this chip!
Progression of The ArchitectureVacuum tubes -- 1940 – 1950
Transistors -- 1950 – 1964
Integrated circuits -- 1964 – 1971
Microprocessor chips -- 1971 – present
Intel 4004 1971
Growth in Microprocessor Performance
Current CPUArchitecture
•Basic CPU Overview
Single Bus
Slow Performance
Example of Triple
Bus Architecture
12
Cost of Microprocessors
Intel microprocessor die
Moore’s Law
Technology Scaling
i4004
i8086
i80386
Pentium
i80486
i80286
SU MIPS
R3010
R4400
R10000
1000
10000
100000
1000000
10000000
100000000
1965 1970 1975 1980 1985 1990 1995 2000 2005
Tra
nsis
tors
i80x86
M68K
MIPS
Alpha
° In ~1985 the single-chip processor (32-bit) and the single-board computer emerged
° In the 2002+ timeframe, these may well look like mainframes compared single-chip computer (maybe 2 chips)
DRAMYear Size1980 64 Kb1983 256 Kb1986 1 Mb1989 4 Mb1992 16 Mb1996 64 Mb1999 256 Mb2002 1 Gb
uP-Name
Microprocessor Logic DensityDRAM chip capacityTechnology Trends
Technology Trends
Smaller feature sizes – higher speed, density
ECE/CS 752; copyright J. E. Smith, 2002 (Univ. of Wisconsin)
Technology Trends
Number of transistors doubles every 18 months
(amended to 24 months)
ECE/CS 752; copyright J. E. Smith, 2002 (Univ. of Wisconsin)
Motherboards / Chipsets / Sockets
•Chipset
In charge of:
•Memory Controller
•EIDE Controller
•PCI Bridge
•Real Time Clock
•DMA Controller
•IRDA Controller
•Keyboard
•Mouse
•Secondary Cache
•Low-Power CMOS SRAM
Sockets
•Socket 4 & 5
•Socket 7
•Socket 8
•Slot 1
•Slot A
DX4100 Picture
•Allows for Real Time Rendering Graphics on a small PC
•GPUs are true processing units
•Geforce3 contains 57 million transistors on a 0.15 micron manufacturing process
•Pentium 4 contains 42 million transistors on a 0.18 micron process
More GPU
SourcesSource for DX4100 PictureOneironauthttp://oneironaut.tripod.com/dx4100.jpg Source for Computer Architecture Overview Picturehttp://www.eecs.tulane.edu/courses/cpen201/slides/201Intro.pdf Pictures of CPU Overview, Single Bus Architecture, Tripe Bus ArchitectureRoy M. Wnek Virginia Tech. CS5515 Lecture 5http://www.nvc.cs.vt.edu/~wnek/cs5515/slide/Grad_Arch_5.PDF Historical Data and PicturesThe Computer Museum History Center.http://www.computerhistory.org/ Intel Motherboard Diagram/Pentium 4 PictureIntel Corporationhttp://www.intel.com The AbacusAbacus-Online-Museumhttp://www.hh.schule.de/metalltechnik-didaktik/users/luetjens/abakus/china/china.htm Information Also fromClint Flerihttp://www.geocities.com/cfleri/
Memory FunctionalityDana Angluinhttp://zoo.cs.yale.edu/classes/cs201/Fall_2001/handouts/lecture-13/node4.html Benchmark GraphicsDigital Lifehttp://www.digit-life.com/articles/pentium4/index3.html Chipset and Socket InformationMotherboards.orghttp://www.motherboards.org/articlesd/tech-planations/17_2.html Amd Processor PicturesToms hardwarehttp://www6.tomshardware.com/search/search.html?category=all&words=Athlon GPU Info4th Wave Inc.http://www.wave-report.com/tutorials/gpu.htm NV20 Design PicturesDigital Lifehttp://www.digit-life.com/articles/nv20/
Main Memory
Memory Hierarchy
DRAM vs. SRAM•DRAM is short for Dynamic Random Access Memory
•SRAM is short for Static Random Access Memory
DRAM is dynamic in that, unlike SRAM, it needs to have its storage cells refreshed or given a new electronic charge every few milliseconds. SRAM does not need refreshing because it operates on the principle of moving current that is switched in one of two directions rather than a storage cell that holds a charge in place.
Parity vs. Non-Parity Parity is error detection that was developed
to notify the user of any data errors. By adding a single bit to each byte of data, this bit is responsible for checking the integrity of the other 8 bits while the byte is moved or stored.
Since memory errors are so rare, many of today’s memory is non-parity.
Six Generations of DRAMs
i4004
i8086
i80386
Pentium
i80486
i80286
SU MIPS
R3010
R4400
R10000
1000
10000
100000
1000000
10000000
100000000
1965 1970 1975 1980 1985 1990 1995 2000 2005
Tra
nsis
tors
i80x86
M68K
MIPS
Alpha
° In ~1985 the single-chip processor (32-bit) and the single-board computer emerged
° In the 2002+ timeframe, these may well look like mainframes compared single-chip computer (maybe 2 chips)
DRAMYear Size1980 64 Kb1983 256 Kb1986 1 Mb1989 4 Mb1992 16 Mb1996 64 Mb1999 256 Mb2002 1 Gb
uP-Name
Microprocessor Logic DensityDRAM chip capacityTechnology Trends
SIMM vs. DIMM vs. RIMM? SIMM-Single In-line Memory Module DIMM-Dual In-line Memory Modules RIMM-Rambus In-line Memory Modules
SIMMs offer a 32-bit data path while DIMMs offer a 64-bit data path. SIMMs have to be used in pairs on Pentiums and more recent processors
RIMM is the one of the latest designs. Because of the fast data transfer rate of these modules, a heat spreader (aluminum plate covering) is used for each module
Evolution of Memory 1970 RAM / DRAM 4.77 MHz 1987 FPM 20 MHz 1995 EDO 20 MHz 1997 PC66 SDRAM 66 MHz 1998 PC100 SDRAM 100 MHz 1999 RDRAM 800 MHz 1999/2000 PC133 SDRAM 133 MHz 2000 DDR SDRAM 266 MHz 2001 EDRAM 450MHz
Updated Technology Trends(Summary)
Capacity Speed (latency)
Logic 4x in 4 years 2x in 3 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 2 years 2x in 10 years
Network (bandwidth) 10x in 5 years
• Updates during your study period??
BS (4 yrs)
MS (2 yrs)
PhD (5 yrs)
• FPM-Fast Page Mode DRAM -traditional DRAM
•EDO-Extended Data Output -increases the Read cycle between Memory and the CPU
•SDRAM-Synchronous DRAM -synchronizes itself with the CPU bus and runs at higher clock speeds
•RDRAM-Rambus DRAM -DRAM with a very high bandwidth (1.6 GBps)
•EDRAM-Enhanced DRAM -(dynamic or power-refreshed RAM) that includes a small amount of static RAM (SRAM) inside a larger amount of DRAM so that many memory accesses will be to the faster SRAM. EDRAM is sometimes used as L1 and L2 memory and, together with Enhanced Synchronous Dynamic DRAM, is known as cached DRAM.
Read Operation•On a read the CPU will first try to find the data in the cache, if it is not there the cache will get updated from the main memory and then return the data to the CPU.
Write Operation• On a write the CPU will write the information
into the cache and the main memory.
References http://www-ece.ucsd.edu/~weathers/ece30/downloads/Ch7_memory(4x).pdf http://home.cfl.rr.com/bjp/eric/ComputerMemory.html http://aggregate.org/EE380/JEL/ch1.pdf
Defining a Bus A parallel circuit that connects the major
components of a computer, allowing the transfer of electric impulses from one connected component to any other
VESA - Video Electronics Standards Association
32 bit bus Found mostly on 486 machines Relied on the 486 processor to function People started to switch to the PCI bus
because of this Otherwise known as VLB
ISA - Industry Standard Architecture
Very old technology Bus speed 8mhz Speed of 42.4 Mb/s maximum Very few ISA ports are found in
modern machines.
MCA - Micro Channel Bus
IBM’s attempt to compete with the ISA bus 32 bit bus Automatically configured cards (Like Plug and
Play) Not compatible with ISA
EISA - Extended Industry Standard Architecture Attempt to compete with IBM’s MCA bus Run on a 8.33Mhz cycle rate 32 bit slots Backward compatible with ISA Went the way of MCA
PCI – Peripheral Component Interconnect
Speeds up to 960 Mb/s Bus speed of 33mhz 16-bit architecture Developed by Intel in 1993 Synchronous or Asynchronous PCI popularized Plug and Play Runs at half of the system bus speed
PCI – X Up to 133 Mhz bus speed 64-bit bandwidth 1GB/sec throughput Backwards compatible with all PCI Primarily developed for increased I/O
demands of technologies such as Fibre Channel, Gigabit Ethernet and Ultra3 SCSI.
AGP – Accelerated Graphics Port
Essentially a high speed PCI Port Capable of running at 4 times PCI
bus speed. (133mhz) Used for High speed 3D graphics
cards Considered a port not a bus
Only two devices involved Is not expandable
BUS Width (bits)
Bus Speed
(Mhz)
Bus Bandwith
(Mbytes/sec)
8-bit ISA 8 8.3 7.9
16-bit ISA 16 8.3 15.9
EISA 32 8.3 31.8
VLB 32 33 127.2
PCI 32 33 127.2
AGP 32 66 254.3
AGP(X2) 32 66 X 2 508.6
AGP(X4) 32 66 X 4 1017.3
IDE - Integrated Drive Electronics
Tons of other names: ATA, ATA/ATAPI, EIDE, ATA-2, Fast ATA, ATA-3, Ultra ATA, Ultra DMA
Good performance at a cheap cost
Most widely used interface for hard disks
SCSI - Small Computer System Interface “skuzzy”
Capable of handling internal/external peripherals
Speed anywhere from 80 – 640 Mb/s
Many types of SCSI
TYPE Bus Speed, MBytes/
Sec. Max.
Bus Width,
bits
Max. Device
Support
SCSI-1 5 8 8
Fast SCSI 10 8 8
Fast WideSCSI
20 16 16
Ultra SCSI 20 8 8
Ultra Wide SCSI 40 16 16
Ultra2 SCSI 40 8 8
Wide Ultra2 SCSI 80 16 16
Ultra3 SCSI 160 16 16
Ultra320 SCSI 320 16 16
Serial Port Uses DB9 or DB25
connector Adheres to RS-232c
spec Capable of speeds up to
115kb/sec
USB 1.0
hot plug-and-play Full speed USB devices signal at 12Mb/s Low speed devices use a 1.5Mb/s
subchannel. Up to 127 devices chained together
2.0 data rate of 480 mega bits per second
USB On-The-Go For portable devices. Limited host capability to communicate with
selected other USB peripherals A small USB connector to fit the mobile form
factor
Firewire i.e. IEEE 1394 and i.LINK
High speed serial port 400 mbps transfer rate 30 times faster than USB 1.0 hot plug-and-play
PS/2 Port Mini Din Plug with 6 pins
Mouse port and keyboard port Developed by IBM
Parallel port i.e. “printer port” Old type Two “new” types ECP (extended capabilities port)
and EPP (enhanced parallel port) Ten times faster than old parallel
port Capable of bi-directional
communication.
Game Port Uses a db15 port Used for joystick connection to the
computer
Parallel Computer Architecture
Need for High Performance Computing There’s a need for tremendous
computational capabilities in science engineering and business
There are applications that require gigabytes of memory and gigaflops of performance
What is a High Performance Computer Definition of a High Performance computer :
An HPC computer can solve large problems in a reasonable amount of time
Characteristics : Fast Computation Large memory High speed interconnect High speed input /output
How is an HPC computer made to go fast Make the sequential computation faster
Do more things in parallel
Applications1> Weather Prediction2> Aircraft and Automobile Design3> Artificial Intelligence4> Entertainment Industry5> Military Applications6> Financial Analysis7> Seismic exploration8> Automobile crash testing
Who Makes High Performance Computers* SGI/Cray Power Challenge Array Origin-2000 T3D/T3E* HP/Convex SPP-1200 SPP-2000* IBM SP2 * Tandem•
Trends in Computer Design Performance of the fastest computer has
grown exponentially from 1945 to the present averaging a factor of 10 every five years
The growth flattened somewhat in 1980s but is accelerating again as massively parallel computers became available
Increase in the No of Processors
Real World Sequential ProcessesSequential processes we find in the world.The passage of time is a classic example of a
sequential process.Day breaks as the sun rises in the morning.Daytime has its sunlight and bright sky.Dusk sees the sun setting in the horizon.Nighttime descends with its moonlight, dark sky
and stars.
Music
An orchestra performance, where every instrument plays its own part, and playing together they make beautiful music.
Parallel ProcessesParallel Processes
Parallel Features of Computers
Various methods available on computers for doing work in parallel are :
Computing environmentOperating system
Memory
Disk
Arithmetic
Computing Environment - Parallel FeaturesUsing a timesharing environment
The computer's resources are shared among many users who are logged in simultaneously.
Your process uses the cpu for a time slice, and then is rolled out while another user’s process is allowed to compute.
The opposite of this is to use dedicated mode where yours is the only job running.
The computer overlaps computation and I/OWhile one process is writing to disk, the computer lets
another process do some computation
Operating System - Parallel Features
Using the UNIX background processing facilitya.out > results &
man etime
Using the UNIX Cron jobs featureYou submit a job that will run at a later time.
Then you can play tennis while the computer continues to work.
This overlaps your computer work with your personal time.
Memory - Parallel Features
Memory InterleavingMemory is divided into multiple banks, and consecutive
data elements are interleaved among them.
There are multiple ports to memory. When the data elements that are spread across the banks are needed, they can be accessed and fetched in parallel.
The memory interleaving increases the memory bandwidth.
Memory - Parallel Features(Cont) Multiple levels of the memory hierarchy
Global memory which any processor can access.
Memory local to a partition of the processors.
Memory local to a single processor:
cache memory
memory elements held in registers
Disk - Parallel FeaturesRAID disk
Redundant Array of Inexpensive Disk
Striped diskWhen a dataset is written to disk, it is broken into
pieces which are written simultaneously to different disks in a RAID disk system.
When the same dataset is read back in, the pieces of the dataset are read in parallel, and the original dataset is reassembled in memory.
Arithmetic - Parallel Features
We will examine the following features that lend themselves to parallel arithmetic:Multiple Functional Units
Super Scalar arithmetic
Instruction Pipelining
Parallel Machine Model (Architectures) von Neumann Computer
MultiComputer A multicomputer comprises a number of
von Neumann computers or nodes linked by a interconnection network
In a idealized network the cost of sending the a message between two nodes is independent of both node location and other network traffic but does depend on message length
Locality
Scalibility
Concurrency
Distributed Memory (MIMD)
MIMD means that each processor can execute separate stream of instructions on its own local
data,distributed memory means that memory is distributed among the processors rather than placed in a central location
Difference between multicomputer and MIMD
The cost of sending a message between multicomputer and the distributed memory is not independent of node location and other network traffic
Examples of MIMD machine
MultiProcessor or Shared Memory MIMD All processors share access to a common
memory via bus or hierarchy of buses
Example for Shared Memory MIMD Silicon Graphics Challenge
SIMD Machines All processors execute the same instruction
stream on a different piece of data
Example of SIMD machine: MasPar MP
Use of Cache
Why is cache used on parallel computers?The advances in memory technology aren’t keeping up with
processor innovations.Memory isn’t speeding up as fast as the processors.One way to alleviate the performance gap between main
memory and the processors is to have local cache.The cache memory can be accessed faster than the main
memory.Cache keeps up with the fast processors, and keeps them
busy with data.
processor processor processor1 2 3
Shared Memory
Network
Cache Cache Cache Memory 1 Memory 2 Memory 3
Cache Coherence
What is cache coherence? Keeps a data element found in several caches current
with each other and with the value in main memory.
Various cache coherence protocols are used.snoopy protocoldirectory based protocol
Various Other Issues Data Locality Issue Distributed Memory Issue Shared Memory Issue
Thanks
106
Stack (Bauer 1955) Data structure with FIFO principle
Two operations: push & pop
Present in basically all architectures Used for both data and addresses Stack Pointer as a special purpose register Often special instructions for push & pop
291242
SP
2912
SP
29
SP
-5
PushPop
4212
Pop
29-5
SP