Download - Computer Architecture & Related Topics. Computer Architecture “The architecture of a computer is the interface between the machine and the software” -

Computer Architecture& Related Topics

Computer Architecture“The architecture of a computer is the

interface between the machine and the software”

- Andris Padges

IBM 360/370 Architect

PresentationTopics

Computer Architecture History

Single Cpu Design

GPU Design (Brief)

Memory Architecture

Communications Architecture

Dual Processor Design

Parallel & Supercomputing Design

What is “Computer Architecture”?

I/O systemProcessor

CompilerOperating System

(Unix; Windows 9x)

Application (Netscape)

Digital DesignCircuit Design

Instruction Set Architecture

Key Idea: levels of abstractionhide unnecessary implementation detailshelps us cope with enormous complexity of real

systems

Datapath & Control

transistors, IC layout

MemoryHardware

Software Assembler

CS 161

What is “Computer Architecture”?Computer Architecture =

Instruction Set Architecture (ISA)

the one “true” language of a machine boundary between hardware and software the hardware’s specification; defines “what” a

machine does;

+Machine Organization

the “guts” of the machine; “how” the hardware works; the implementation; must obey the ISA abstraction

Part 1 History and Single Cpu

HISTORY!!!One of the first computing devices to come about was . .

The ABACUS!

• Completed:1946

• Programmed: plug board and switches

• Speed:5,000 operations per second

• Input/output: cards, lights, switches, plugs

• Floor space:1,000 square feet

The ENIAC : 1946

The EDSAC(1949) and The UNIVAC I(1951)

EDSAC

Technology:vacuum tubes

Memory:1K words

Speed:714 operations per second

First practical stored-program computer

UNIVAC

Speed:1,905 operations per second

Input/output:magnetic tape, unityper, printer

Memory size:1,000 12-digit words in delay lines

Memory type:delay lines, magnetic tape

Technology:serial vacuum tubes, delay lines, magnetic tape

Floor space:943 cubic feet

Cost:F.O.B. factory $750,000 plus $185,000 for a high speed printer

Lecture 1 11

Intel 4004 - 1971

• The first microprocessor

• 2,300 transistors• 108 KHz• 10m process

Lecture 1 13

Intel Pentium IV - 2001

• “State of the art”

• 42 million transistors

• 2GHz• 0.13m process

• Could fit ~15,000 4004s on this chip!

Progression of The ArchitectureVacuum tubes -- 1940 – 1950

Transistors -- 1950 – 1964

Integrated circuits -- 1964 – 1971

Microprocessor chips -- 1971 – present

Intel 4004 1971

Growth in Microprocessor Performance

Current CPUArchitecture

•Basic CPU Overview

Single Bus

Slow Performance

Example of Triple

Bus Architecture

12

Cost of Microprocessors

Intel microprocessor die

Moore’s Law

Technology Scaling

i4004

i8086

i80386

Pentium

i80486

i80286

SU MIPS

R3010

R4400

R10000

1000

10000

100000

1000000

10000000

100000000

1965 1970 1975 1980 1985 1990 1995 2000 2005

Tra

nsis

tors

i80x86

M68K

MIPS

Alpha

° In ~1985 the single-chip processor (32-bit) and the single-board computer emerged

° In the 2002+ timeframe, these may well look like mainframes compared single-chip computer (maybe 2 chips)

DRAMYear Size1980 64 Kb1983 256 Kb1986 1 Mb1989 4 Mb1992 16 Mb1996 64 Mb1999 256 Mb2002 1 Gb

uP-Name

Microprocessor Logic DensityDRAM chip capacityTechnology Trends

Technology Trends

Smaller feature sizes – higher speed, density

ECE/CS 752; copyright J. E. Smith, 2002 (Univ. of Wisconsin)

Technology Trends

Number of transistors doubles every 18 months

(amended to 24 months)

ECE/CS 752; copyright J. E. Smith, 2002 (Univ. of Wisconsin)

Motherboards / Chipsets / Sockets

•Chipset

In charge of:

•Memory Controller

•EIDE Controller

•PCI Bridge

•Real Time Clock

•DMA Controller

•IRDA Controller

•Keyboard

•Mouse

•Secondary Cache

•Low-Power CMOS SRAM

Sockets

•Socket 4 & 5

•Socket 7

•Socket 8

•Slot 1

•Slot A

DX4100 Picture

•Allows for Real Time Rendering Graphics on a small PC

•GPUs are true processing units

•Geforce3 contains 57 million transistors on a 0.15 micron manufacturing process

•Pentium 4 contains 42 million transistors on a 0.18 micron process

More GPU

SourcesSource for DX4100 PictureOneironauthttp://oneironaut.tripod.com/dx4100.jpg Source for Computer Architecture Overview Picturehttp://www.eecs.tulane.edu/courses/cpen201/slides/201Intro.pdf Pictures of CPU Overview, Single Bus Architecture, Tripe Bus ArchitectureRoy M. Wnek Virginia Tech. CS5515 Lecture 5http://www.nvc.cs.vt.edu/~wnek/cs5515/slide/Grad_Arch_5.PDF Historical Data and PicturesThe Computer Museum History Center.http://www.computerhistory.org/ Intel Motherboard Diagram/Pentium 4 PictureIntel Corporationhttp://www.intel.com The AbacusAbacus-Online-Museumhttp://www.hh.schule.de/metalltechnik-didaktik/users/luetjens/abakus/china/china.htm Information Also fromClint Flerihttp://www.geocities.com/cfleri/

Memory FunctionalityDana Angluinhttp://zoo.cs.yale.edu/classes/cs201/Fall_2001/handouts/lecture-13/node4.html Benchmark GraphicsDigital Lifehttp://www.digit-life.com/articles/pentium4/index3.html Chipset and Socket InformationMotherboards.orghttp://www.motherboards.org/articlesd/tech-planations/17_2.html Amd Processor PicturesToms hardwarehttp://www6.tomshardware.com/search/search.html?category=all&words=Athlon GPU Info4th Wave Inc.http://www.wave-report.com/tutorials/gpu.htm NV20 Design PicturesDigital Lifehttp://www.digit-life.com/articles/nv20/

Main Memory

Memory Hierarchy

DRAM vs. SRAM•DRAM is short for Dynamic Random Access Memory

•SRAM is short for Static Random Access Memory

DRAM is dynamic in that, unlike SRAM, it needs to have its storage cells refreshed or given a new electronic charge every few milliseconds. SRAM does not need refreshing because it operates on the principle of moving current that is switched in one of two directions rather than a storage cell that holds a charge in place.

Parity vs. Non-Parity Parity is error detection that was developed

to notify the user of any data errors. By adding a single bit to each byte of data, this bit is responsible for checking the integrity of the other 8 bits while the byte is moved or stored.

Since memory errors are so rare, many of today’s memory is non-parity.

Six Generations of DRAMs

i4004

i8086

i80386

Pentium

i80486

i80286

SU MIPS

R3010

R4400

R10000

1000

10000

100000

1000000

10000000

100000000

1965 1970 1975 1980 1985 1990 1995 2000 2005

Tra

nsis

tors

i80x86

M68K

MIPS

Alpha

° In ~1985 the single-chip processor (32-bit) and the single-board computer emerged

° In the 2002+ timeframe, these may well look like mainframes compared single-chip computer (maybe 2 chips)

DRAMYear Size1980 64 Kb1983 256 Kb1986 1 Mb1989 4 Mb1992 16 Mb1996 64 Mb1999 256 Mb2002 1 Gb

uP-Name

Microprocessor Logic DensityDRAM chip capacityTechnology Trends

SIMM vs. DIMM vs. RIMM? SIMM-Single In-line Memory Module DIMM-Dual In-line Memory Modules RIMM-Rambus In-line Memory Modules

SIMMs offer a 32-bit data path while DIMMs offer a 64-bit data path. SIMMs have to be used in pairs on Pentiums and more recent processors

RIMM is the one of the latest designs. Because of the fast data transfer rate of these modules, a heat spreader (aluminum plate covering) is used for each module

Evolution of Memory 1970 RAM / DRAM 4.77 MHz 1987 FPM 20 MHz 1995 EDO 20 MHz 1997 PC66 SDRAM 66 MHz 1998 PC100 SDRAM 100 MHz 1999 RDRAM 800 MHz 1999/2000 PC133 SDRAM 133 MHz 2000 DDR SDRAM 266 MHz 2001 EDRAM 450MHz

Updated Technology Trends(Summary)

Capacity Speed (latency)

Logic 4x in 4 years 2x in 3 years

DRAM 4x in 3 years 2x in 10 years

Disk 4x in 2 years 2x in 10 years

Network (bandwidth) 10x in 5 years

• Updates during your study period??

BS (4 yrs)

MS (2 yrs)

PhD (5 yrs)

• FPM-Fast Page Mode DRAM -traditional DRAM

•EDO-Extended Data Output -increases the Read cycle between Memory and the CPU

•SDRAM-Synchronous DRAM -synchronizes itself with the CPU bus and runs at higher clock speeds

•RDRAM-Rambus DRAM -DRAM with a very high bandwidth (1.6 GBps)

•EDRAM-Enhanced DRAM -(dynamic or power-refreshed RAM) that includes a small amount of static RAM (SRAM) inside a larger amount of DRAM so that many memory accesses will be to the faster SRAM. EDRAM is sometimes used as L1 and L2 memory and, together with Enhanced Synchronous Dynamic DRAM, is known as cached DRAM.

Read Operation•On a read the CPU will first try to find the data in the cache, if it is not there the cache will get updated from the main memory and then return the data to the CPU.

Write Operation• On a write the CPU will write the information

into the cache and the main memory.

References http://www-ece.ucsd.edu/~weathers/ece30/downloads/Ch7_memory(4x).pdf http://home.cfl.rr.com/bjp/eric/ComputerMemory.html http://aggregate.org/EE380/JEL/ch1.pdf

Defining a Bus A parallel circuit that connects the major

components of a computer, allowing the transfer of electric impulses from one connected component to any other

VESA - Video Electronics Standards Association

32 bit bus Found mostly on 486 machines Relied on the 486 processor to function People started to switch to the PCI bus

because of this Otherwise known as VLB

ISA - Industry Standard Architecture

Very old technology Bus speed 8mhz Speed of 42.4 Mb/s maximum Very few ISA ports are found in

modern machines.

MCA - Micro Channel Bus

IBM’s attempt to compete with the ISA bus 32 bit bus Automatically configured cards (Like Plug and

Play) Not compatible with ISA

EISA - Extended Industry Standard Architecture Attempt to compete with IBM’s MCA bus Run on a 8.33Mhz cycle rate 32 bit slots Backward compatible with ISA Went the way of MCA

PCI – Peripheral Component Interconnect

Speeds up to 960 Mb/s Bus speed of 33mhz 16-bit architecture Developed by Intel in 1993 Synchronous or Asynchronous PCI popularized Plug and Play Runs at half of the system bus speed

PCI – X Up to 133 Mhz bus speed 64-bit bandwidth 1GB/sec throughput Backwards compatible with all PCI Primarily developed for increased I/O

demands of technologies such as Fibre Channel, Gigabit Ethernet and Ultra3 SCSI.

AGP – Accelerated Graphics Port

Essentially a high speed PCI Port Capable of running at 4 times PCI

bus speed. (133mhz) Used for High speed 3D graphics

cards Considered a port not a bus

Only two devices involved Is not expandable

BUS Width (bits)

Bus Speed

(Mhz)

Bus Bandwith

(Mbytes/sec)

8-bit ISA 8 8.3 7.9

16-bit ISA 16 8.3 15.9

EISA 32 8.3 31.8

VLB 32 33 127.2

PCI 32 33 127.2

AGP 32 66 254.3

AGP(X2) 32 66 X 2 508.6

AGP(X4) 32 66 X 4 1017.3

IDE - Integrated Drive Electronics

Tons of other names: ATA, ATA/ATAPI, EIDE, ATA-2, Fast ATA, ATA-3, Ultra ATA, Ultra DMA

Good performance at a cheap cost

Most widely used interface for hard disks

SCSI - Small Computer System Interface “skuzzy”

Capable of handling internal/external peripherals

Speed anywhere from 80 – 640 Mb/s

Many types of SCSI

TYPE Bus Speed, MBytes/

Sec. Max.

Bus Width,

bits

Max. Device

Support

SCSI-1 5 8 8

Fast SCSI 10 8 8

Fast WideSCSI

20 16 16

Ultra SCSI 20 8 8

Ultra Wide SCSI 40 16 16

Ultra2 SCSI 40 8 8

Wide Ultra2 SCSI 80 16 16

Ultra3 SCSI 160 16 16

Ultra320 SCSI 320 16 16

Serial Port Uses DB9 or DB25

connector Adheres to RS-232c

spec Capable of speeds up to

115kb/sec

USB 1.0

hot plug-and-play Full speed USB devices signal at 12Mb/s Low speed devices use a 1.5Mb/s

subchannel. Up to 127 devices chained together

2.0 data rate of 480 mega bits per second

USB On-The-Go For portable devices. Limited host capability to communicate with

selected other USB peripherals A small USB connector to fit the mobile form

factor

Firewire i.e. IEEE 1394 and i.LINK

High speed serial port 400 mbps transfer rate 30 times faster than USB 1.0 hot plug-and-play

PS/2 Port Mini Din Plug with 6 pins

Mouse port and keyboard port Developed by IBM

Parallel port i.e. “printer port” Old type Two “new” types ECP (extended capabilities port)

and EPP (enhanced parallel port) Ten times faster than old parallel

port Capable of bi-directional

communication.

Game Port Uses a db15 port Used for joystick connection to the

computer

Parallel Computer Architecture

Need for High Performance Computing There’s a need for tremendous

computational capabilities in science engineering and business

There are applications that require gigabytes of memory and gigaflops of performance

What is a High Performance Computer Definition of a High Performance computer :

An HPC computer can solve large problems in a reasonable amount of time

Characteristics : Fast Computation Large memory High speed interconnect High speed input /output

How is an HPC computer made to go fast Make the sequential computation faster

Do more things in parallel

Applications1> Weather Prediction2> Aircraft and Automobile Design3> Artificial Intelligence4> Entertainment Industry5> Military Applications6> Financial Analysis7> Seismic exploration8> Automobile crash testing

Who Makes High Performance Computers* SGI/Cray Power Challenge Array Origin-2000 T3D/T3E* HP/Convex SPP-1200 SPP-2000* IBM SP2 * Tandem•

Trends in Computer Design Performance of the fastest computer has

grown exponentially from 1945 to the present averaging a factor of 10 every five years

The growth flattened somewhat in 1980s but is accelerating again as massively parallel computers became available

Increase in the No of Processors

Real World Sequential ProcessesSequential processes we find in the world.The passage of time is a classic example of a

sequential process.Day breaks as the sun rises in the morning.Daytime has its sunlight and bright sky.Dusk sees the sun setting in the horizon.Nighttime descends with its moonlight, dark sky

and stars.

Music

An orchestra performance, where every instrument plays its own part, and playing together they make beautiful music.

Parallel ProcessesParallel Processes

Parallel Features of Computers

Various methods available on computers for doing work in parallel are :

Computing environmentOperating system

Memory

Disk

Arithmetic

Computing Environment - Parallel FeaturesUsing a timesharing environment

The computer's resources are shared among many users who are logged in simultaneously.

Your process uses the cpu for a time slice, and then is rolled out while another user’s process is allowed to compute.

The opposite of this is to use dedicated mode where yours is the only job running.

The computer overlaps computation and I/OWhile one process is writing to disk, the computer lets

another process do some computation

Operating System - Parallel Features

Using the UNIX background processing facilitya.out > results &

man etime

Using the UNIX Cron jobs featureYou submit a job that will run at a later time.

Then you can play tennis while the computer continues to work.

This overlaps your computer work with your personal time.

Memory - Parallel Features

Memory InterleavingMemory is divided into multiple banks, and consecutive

data elements are interleaved among them.

There are multiple ports to memory. When the data elements that are spread across the banks are needed, they can be accessed and fetched in parallel.

The memory interleaving increases the memory bandwidth.

Memory - Parallel Features(Cont) Multiple levels of the memory hierarchy

Global memory which any processor can access.

Memory local to a partition of the processors.

Memory local to a single processor:

cache memory

memory elements held in registers

Disk - Parallel FeaturesRAID disk

Redundant Array of Inexpensive Disk

Striped diskWhen a dataset is written to disk, it is broken into

pieces which are written simultaneously to different disks in a RAID disk system.

When the same dataset is read back in, the pieces of the dataset are read in parallel, and the original dataset is reassembled in memory.

Arithmetic - Parallel Features

We will examine the following features that lend themselves to parallel arithmetic:Multiple Functional Units

Super Scalar arithmetic

Instruction Pipelining

Parallel Machine Model (Architectures) von Neumann Computer

MultiComputer A multicomputer comprises a number of

von Neumann computers or nodes linked by a interconnection network

In a idealized network the cost of sending the a message between two nodes is independent of both node location and other network traffic but does depend on message length

Locality

Scalibility

Concurrency

Distributed Memory (MIMD)

MIMD means that each processor can execute separate stream of instructions on its own local

data,distributed memory means that memory is distributed among the processors rather than placed in a central location

Difference between multicomputer and MIMD

The cost of sending a message between multicomputer and the distributed memory is not independent of node location and other network traffic

Examples of MIMD machine

MultiProcessor or Shared Memory MIMD All processors share access to a common

memory via bus or hierarchy of buses

Example for Shared Memory MIMD Silicon Graphics Challenge

SIMD Machines All processors execute the same instruction

stream on a different piece of data

Example of SIMD machine: MasPar MP

Use of Cache

Why is cache used on parallel computers?The advances in memory technology aren’t keeping up with

processor innovations.Memory isn’t speeding up as fast as the processors.One way to alleviate the performance gap between main

memory and the processors is to have local cache.The cache memory can be accessed faster than the main

memory.Cache keeps up with the fast processors, and keeps them

busy with data.

processor processor processor1 2 3

Shared Memory

Network

Cache Cache Cache Memory 1 Memory 2 Memory 3

Cache Coherence

What is cache coherence? Keeps a data element found in several caches current

with each other and with the value in main memory.

Various cache coherence protocols are used.snoopy protocoldirectory based protocol

Various Other Issues Data Locality Issue Distributed Memory Issue Shared Memory Issue

Thanks

106

Stack (Bauer 1955) Data structure with FIFO principle

Two operations: push & pop

Present in basically all architectures Used for both data and addresses Stack Pointer as a special purpose register Often special instructions for push & pop

291242

SP

2912

SP

29

SP

-5

PushPop

4212

Pop

29-5

SP