A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy...

ADVANCED COMPUTER

ARCHITECHTURE

Parallelism, Scalability, Programmability

Dr Mahmoud [email protected]

Iran University of Sciemce and TechnologyComputer Facaulty

Advanced Computer Architecture Dr Fathy

2

What is Computer Architecture


3

Forces on Computer Architecture


4

A Take on Moore’s Law


5

A Take on Moore’s Law• Moore’s Law (1965)• Number of transistors per square inch doubled every year

• Reality: number of per square inch doubled every 18 months

• CPU Speed increases 54% per year

• DRAM Capacity increases 80% per year (Quadrupled every 3 years)

yearTechnology

Relative Performan

ce

1951Vacuum Tube1

1965Transistor35

1975Integrated Circuit900

1995VLSI2,400,000


6

Processor Performance


7

Cleaver Architecture Design


8

Processor – Memory Performance Gap


9

Technology Trend v.s. Power Dissipation


10

“Hot” Computer

Importance of Low Power Processor Design


11

Computer Food Chain


12

Computer Engineering Methodology


13

Measurement and Evaluation


14

Measurement and Evaluation (contd)

Three component in computer architecture evaluation

- Simulators

- Benchmarks

- Evaluation Metrics (Performance, Cost, Power)


15

A Computer Architecture Simulator


16

A taxonomy of Simulator Tools


17

Functional v.s. Performance Simulators


18

Execution v.s. Trace-Driven Simulation


19

Computer Performance

• History of Computer Performance• Execution time of a single instruction (such as addition)• Instruction mix\• MIPS• Mflops (with introducing supercomputers 1970-1980)• Real Programs (dificult running and different operation

systems)• Toy programs (system performance evaluation

cooperation)• 1988 SPEC company was established by SUN, MIPS,

DEC & Appolo


20

SPEC History

• SPEC History• SPEC 89 CPU Intensive (6floating point + 4 integer point)• SPEC 92 (SPECintr, SPECFP), deleting programs such

as Matrix 300 from SPEC89• SPEC 95• SPEC 2000 (11 Integer, Cint 2000, 14 fp CFP 2000)

• SPEC viewperf (3D rendering)• SEC apc -Pro/Engineer

» . -Solid Works (3D CAD)» . -Graphic V15 (aircraft

design)


21

Programs to evaluate Processor Performance


22

Benchmarks


23

Performance & Measuring

• The execution Time of a Program is the main measure of Computer Performance

• A Machine (X) is n% Faster than machine Y if :

1001

n

xoftimeexecution

yoftimeexecution

eperformanc

eperformanceperformanc

Y

YXn

100


24


Example:

If the Machine X executes a program in 10 Seconds and the Machine Y executes the Same program in 15 Seconds. The machine X is 50% Percent faster than Machine Y.

1001

n

imeofXExecutionT

imeofYExecutimeT


25

Performance & Measuring (Amdal’s Law)

• SpeedUP

provementronTimeAfteTheExecuti

TimelExecutionTheOriginaSpeedup

Im

PTT

T

TSpeedUp

ss

Ts=The Sequential Time of the Program

P = The Degree of Parallelism


26


sT

TedUPMaximumSpe

Example: Assume that the processing power of a system have been

increased 10 times, But this part is just 40% of the all execution time. What is the Speedup?

56.1

104.0

6.0

1

SpeedUp


27


Example:The processing power of a CPU has been increased 5 times. But the

cost of the new CPU has been increased 5 times. The CPU time of the program is 50% and the CPU cost is 1/3 of the whole computer cost. Is this upgrade reasonable from cost to performance ratio point?

67.1

55.0

5.0

1

SpeedUP

33.253

11

3

2achineCostofNewM


28


Time in LINUX

Example:

Time 90.7 12.9 2:39 65% (CPU User) (System CPU Time) (Execution Time) (CPU Time /Execution Time)

CPU Time = (Cycles of the Program) / (Clock Rate)

CPU Time = (Cycles Period Of each Clock)

CPU Time = CPI (Clock Per Instruction) (Number of Instructions) (Period of each Clock)


29


• RISC Processor CPI (Small) No of Instructions (Large) Period of each Clock (small)

• CISC ProcessorCPI (Large) No of Instructions (Small) Period of each Clock (Large)

• MIPS=No of Instructions/Execution Time 106

• MIPS=Clock Rate/CPI 106


30


MIPS is not a good metric for performance for example I860 with 50MHz frequency has 100 MFlops Processing Power And 150 MOPS and R3000 (MIPS family Processor) is 16MFlops and 33 MOPS but can execute SPEC program 15% faster than I860.

EXAMPLE: (Showing that MIPS is not a good metric for performance evaluation) A Computer has got 3 types of Instructions with different CPI rates

Instruction TypeCPI

A 1

B 2

C 3


31


The compiler Designer has 2 choices to translate a high level language function.

A B C

Choice 1 2 1 2

Choice 2 4 1 1

What is the CPI of each choice?

CPI1 = 10 / 5 = 2

CPI2 = 9 / 6 = 1.5

Now the compiler designer for translating a program has two choices.

A B CChoice1 5 1 1 (million Instruction)

Choice2 10 1 1 (million Instruction)

What is the MIPS rate and Execution Time of Each Sequence?


32


Example:To show If we add a new instruction to a computer, How it effects

the performance of the system.

The instruction mix of a computer is as follows:

Operation Probability CPI

ALU 0.43 1

Load 0.21 2

Store 0.12 2

Branch 0.24 2

Assume that 25% of ALU operations use a Loaded Operand just One time. It means that this operand is not used in other next instructions. Now we want to add a new REG/MEM instruction type which is an ADD instruction and needs two cycles to execute. This change causes the branch instruction to be executed in 3 cycles. Whether this new machine is faster or the older?


33


57.1.....21.0*243.0*1*clockfCPIold

))43.0*25.0(1(

)2*)43.0*25.0(3*24.02*12.02*))43.*25.0(21.0(1))43.0*24.0(43.0((

NEWCPI

New CPU time = (0.893*Old_Instruction_Count)*1.908*Old CPU time = 1.57*Old_Instruction_Count*

New CPU time = 1.7 * Old_Instruction_Count * Therefore the old machine is faster.

7.1NEWCPI


34


Boeing 470 610Mph 286700Concord 132 1350Mph 178200

Relative MIPS = execution time of reference / (execution time of x * MIPS)

Weighted Megaflops :ADD, SUB, MUL 1DIV,SQR 4EXP,SIN 8

Bench Marks (Dhrystone)Vax 11/780 1.7KD/S INTEL i860 72.5SUN 4 16.8 MC 68040 40CRAY XMP 18.5 VAX 8600 6.4


35


Benchmark (Whetstone):

A FORTRAN floating point program

DEC 11/780 1.15 KW/S

IBM 4321 2

TP1 (A Database Benchmark):

VAX9000 70 TPS

Sequent 140 TPS

Benchmark for Intelligent Computers is measured in KLIPS (Kilo Logic Inference Per Second)

400 KLIPS 40 MIPS


36


XYXY

Execution Time on X

Execution Time on Y

Normalized on X

Normalized on X

Normalized on Y

Normalized on Y

Program A1101100.11

Program B100010010.11010

Arithmatic Mean500.55515.055.051

Geometric Mean31.631.61111


37


Example:

Two programs A & B are run on two machines X & Y. Which machine is faster?

XYXY

Execution Time on

X

Execution Time on

YNormalized on X

Normalized on X

Normalized on

Y

Normalized on

Y

Program A1101100.11

Program B100010010.11010

Arithmatic Mean

500.55515.055.051

Geometric Mean

31.631.61111


38


Example:

ABC

Program 111020

Proigram 2100010020

Total100111040

n

j ji

i

TT

Weight

1

1*

1

Weight for program I on Machine A or B for N program, which Normalizes the execution time.


39



40



41



42



43



44



45



46



47


MIPS1 = Clock Rate/(CPI 106)CPI1 = 1.43CPI2 = 1.25MIPS1 = 69.4MIPS2 = 80

MIPS2>MIPS1CPU Time = No. of Instructions * CPI / Clock rateCPU Time1 = 0.1 SecCPU Time2 = 0.15 Sec

Compiler program 1 is faster than Compiler 2 but It has less MIPS rate ratio. So MIPS cannot be a good metric for performance Evaluation.


48

Parallel Computer Models

Why Parallel Processing?

1990 1980 2000 2010 KIPS

MIPS

GIPS

TIPS

Pro

ce

sso

r pe

rfo

rma

nce

Calendar year

80286 68000

80386

80486 68040

Pentium

Pentium II R10000

1.6 / yr

Fig. 1.1 The exponential growth of microprocessor performance, known as Moore’s Law, shown over the past two decades (extrapolated).


49


• The Semiconductor Technology Roadmap

Calendar year 200120042007201020132016

Halfpitch (nm)1409065453222

Clock freq. (GHz)247122030

Wiring levels789101010

Power supply (V)1.11.00.80.70.60.5

Max. power (W)130160190220250290From the 2001 edition of the roadmap [Alla02]

Factors contributing to the validity of Moore’s law Denser circuits; Architectural improvementsMeasures of processor performance Instructions/second (MIPS, GIPS, TIPS, PIPS) Floating-point operations per second (MFLOPS, GFLOPS, TFLOPS, PFLOPS) Running time on benchmark suites 1990 1980 2000 2010

KIPS

MIPS

GIPS

TIPS

Pro

cess

or

perf

orm

anc

e

Calendar year

80286 68000

80386

80486 68040

Pentium

Pentium II R10000

1.6 / yr


50


• Why High-Performance Computing?

Higher speed (solve problems faster)Important when there are “hard” or “soft” deadlines; e.g., 24-hour weather forecast

Higher throughput (solve more problems)Important when there are many similar tasks to perform;e.g., transaction processing

Higher computational power (solve larger problems)e.g., weather forecast for a week rather than 24 hours,or with a finer mesh for greater accuracy

Categories of supercomputers Uniprocessor; aka vector machine Multiprocessor; centralized or distributed shared memory Multicomputer; communicating via message passing Massively parallel processor (MPP; 1K or more processors)


51


• The Speed-of-Light Argument

The speed of light is about 30 cm/ns.

Signals travel at a fraction of speed of light (say, 1/3).

If signals must travel 1 cm during the execution of an instruction, that instruction will take at least 0.1 ns; thus, performance will be limited to 10 GIPS.

This limitation is eased by continued miniaturization, architectural methods such as cache memory, etc.; however, a fundamental limit does exist.

How does parallel processing help? Wouldn’t multiple processors need to communicate via signals as well?


52


• The Quest for Higher Performance Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)

1 .IBM Blue Gene/L2 .SGI Columbia3 .NEC Earth SimLLNL, CaliforniaNASA Ames, CaliforniaEarth Sim Ctr, Yokohama

Material science, nuclear stockpile sim

Aerospace/space sim, climate research

Atmospheric, oceanic, and earth sciences

32,768 proc’s, 8 TB, 28 TB disk storage



Linux + custom OSLinuxUnix

71 TFLOPS, $100 M52 TFLOPS, $50 M36 TFLOPS*, $400 M?

Dual-proc Power-PC chips (10-15 W power)

20x Altix (512 Itanium2) linked by Infiniband

Built of custom vector microprocessors

Full system: 130k-proc, 360 TFLOPS (est)

Volume = 50x IBM, Power = 14x IBM


53


• Supercomputer Performance Growth

1990 1980 2000 2010 MFLOPS

Su

perc

om

put

er

perf

orm

anc

e

Calendar year

Cray X-MP

Y-MP

CM-2

GFLOPS

TFLOPS

PFLOPS

Vector supers

CM-5

CM-5

$240M MPPs

$30M MPPs

ASCI goals

Micros

80386

80860

Alpha

The exponential growth in supercomputer performance over the past two decades (from [Bell92], with ASCI performance goals and microprocessor peak FLOPS superimposed as dotted lines).


54

One Reason for Sublinear Speedup:Communication Overhead

Number of processors

Communication

Computation

Solution time

Ideal speedup


Actual speedup

Trade-off between communication time and computation time in the data-parallel realization


55

Another Reason for Sublinear Speedup:Input/Output Overhead


I/O time

Computation

Solution time

Ideal speedup


Actual speedup

Effect of a constant I/O time on the data-parallel realization


56

Trends in High-Technology Development

1960 1970 1980 1990 2000

Graphics

Networking

RISC

Parallelism

GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes

IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes

IndDevIndDev

GovResGovResGovResG GovResGovResGovResGo

GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes


IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1


GovRes

IndDev

IndResIndR

$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B

Transfer of ideas/people

Development of some technical fields into $1B businesses and the roles played by government research and industrial R&D over time (IEEE Computer, early 90s?).


57

Trends in Hi-Tech Development (2003)


58

Status of Computing Power (circa 2000)

GFLOPS on desktop: Apple Macintosh, with G4 processor

TFLOPS in supercomputer center: 1152-processor IBM RS/6000 SP (switch-based network) Cray T3E, torus-connected

PFLOPS on drawing board: 1M-processor IBM Blue Gene (2005?) 32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers Processor: 8 threads, on-chip memory, no data cache Chip: defect-tolerant, row/column rings in a 6 6 array Board: 8 8 chip grid organized as 4 4 4 cube Tower: Boards linked to 4 neighbors in adjacent towers System: 323232 cube of chips, 1.5 MW (water-cooled)


59


• Parallel Processing on Single Processor Computers

• 1- Using of Multi Operational Units

• 2- Parallelism and Pipeline Inside a CPU

• 3- Overlapping the Operations of I/O & CPU

• 4- Making Equilibrium in Bandwidth of Subsystems

• 4-1- Bandwidth of CPU (high)

• 4-2- Bandwidth of Memory (less)

• 4-3- Bandwidth of I/O (very little)

• 5- Hierarchy of Memory

• 5-1- register memory 5-2- cache memory

• 5-3- main memory 5-4- secondary memory

• 6- Using of Multi Programs and Time Sharing


60

Types of Parallelism: A Taxonomy

SISD

SIMD

MISD

MIMD

GMSV

GMMP

DMSV

DMMP

Single data stream

Mult iple data streams

Sin

gle

inst

r st

ream

M

ultip

le in

str

stre

ams

Flynn’s categories

Joh

nso

n’s

ex

pan

sio

n

Shared variables

Message passing

Glo

bal

me

mor

y D

istr

ibut

ed

me

mor

y

Uniprocessors

Rarely used

Array or vector processors

Mult iproc’s or mult icomputers

Shared-memory mult iprocessors

Rarely used

Distributed shared memory

Distrib-memory mult icomputers

The Flynn-Johnson classification of computer systems.


61

Parallel Computer Models Parallel Computer Models

Flynn’s classification of computer architectures.


62


• Flynn’s classification of computer architectures (Contd)


63


SISD “Uniprocessor”

SIMD “Array processor”

MISD (Rarely used)

MIMD GMSV GMMP

DMSV DMMP

“Shared-memory multiprocessor”

“Distributed shared memory”

“Distrib-memory multicomputer

Data stream(s) C

ont

rol s

tre

am

(s)

Single Multiple M

ultip

le

Sin

gle

Me

mo

ry

Dis

trib

G

lob

al

Communication/Synchronization

Shared variables

Message passing

SIMD versus MIMD

Global versus

Distributed memory


64

SIMD versus MIMD Architectures

Most early parallel machines had SIMD designs Attractive to have skeleton processors (PEs) Eventually, many processors per chip High development cost for custom chips, high cost MSIMD and SPMD variants

Most modern parallel machines have MIMD designs COTS components (CPU chips and switches) MPP: Massively or moderately parallel? Tightly coupled versus loosely coupled Explicit message passing versus shared memory

Network-based NOWs and COWs Networks/Clusters of workstations

Grid computing Vision: Plug into wall outlets for computing power

1960

1970

1980

1990

2000

2010

ILLIAC IV

TMC CM-2

Goodyear MPP

DAP

MasPar MP-1

Clearspeed array coproc

SIMD Timeline


65

Global versus Distributed Memory

Fig. 4.3 A parallel processor with global memory.

0 0

1 1

Processor-to-memory

network

p-1 m-1

Processor-to-processor

network

Processors Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

Options:CrossbarBus(es)MIN

BottleneckComplexExpensive


66

Removing the Processor-to-Memory Bottleneck

A parallel processor with global memory and processor caches.

0 0

1 1

Processor-to-memory

network

p-1 m-1

Processor-to-processor

network

Processors Caches Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

Challenge:Cache coherence


67

Distributed Shared Memory

0

1

Interconnection network

p-1

Processors

Parallel I/O

.

.

.

.

.

.

Memories

Some Terminology:

NUMANonuniform memory access(distributed shared memory)

UMAUniform memory access(global shared memory)

COMACache-only memory arch


68


اتصالی • های شبکه• Fully connected

• Hypercube

• Mesh

• Ring

• Cube

• Star

• .

• .

• .


69


• Multiprocessors and Multicomputers• Shared-Memory Multiprocessor • The UMA (Uniform Menory Access) Model• In a UMA Multiprocessor model , the physical memory is uniformly shared

by all the processors. All processors have equal access time to all memory words, which is why it is called uniform memory access.

• Multiprocessors are called tightly• coupled systems due to the degree• Of resource sharing.


70


symmetric multiprocessorWhen all processors have equal access to all

peripheral devices, the system is called a symmetric multiprocessor. In such a case, all processors are equally capable of running the executive programs.

In a asymmetric multiprocessor, only one or a subset of processors are executive-capable.

The remaining processors have no I/O capability are called attached processors.


71


The NUMA (Uniform Menory Access) ModelA NUMA multiprocessor is a shared-memory system in which the

access time varies with the location of the memory word


72


Besides distributed memories, globally shared memory can be added to multiprocessor system. In this case, there are three memory-access pattern: the fastest is local memory access. The next is global memory access. The slowest is access of remote memory as illustrated in this picture.


73


The COMA ModelA multi processor using cache-only memory assumes the COMA model. This

model is depicted the following picture.

The COMA model is a special case of NUMA machine, in which the distributed main memories are converted to caches. There is no memory hierarchy at each processor node.

Besides the UMA, NUMA and coma models specified above, other variations exist for multiprocessors. For example, a cache-coherent non-uniform memory access (CC-NUMA) model can be specified with distributed shared memory and cache directories.


74


Representative Multicomputers.Several commercialy multiprocessors are summerized in the following table:


75


Distributed-Memory MultiprocessorA distributed-Memory Multiprocessor system is modeled in the

following figure. The system consists of multiple computers, often called nodes, interconnected by a message-passing network. Each node is an autonomous computer consisting of a processor, local memory, and sometimes attached disks or I/O peripherals.

The message-passing network provides point to point static connections among the nodes. All local memories are private and accessible only by local processors. For this resean, traditional multicomputers have been called no-remote-memory-access (NORMA) machines.


76


Representative MulticomputersThree message-passing multicomputers are summarized in the following

table. With distributed processor/memory nodes, these machines are better in achieving a scalable performance. However, message passing imposes a hardship on programmers to distribute the computations and data sets over the nodes or to establish communication among nodes.


77


A Taxonomy of MIMD Computers

Parallel computers appear as either SIMD or MIMD configurations. The SIMDs appeal more to special-purpose applications. It is clear that SIMDs are not size-scalable, but unclear whether large SIMDs are generation-scalable. The fact that CM-5 has an MIMD architecture, away from the SIMD architecture in CM-2, may shed some light on the architectural trend. Furthermore, the boundary between multiprocessors and multicomputers has become blurred in recent years, eventually, the distinction may vanish.


78


Multivector and SIMD ComputersHere we introduce supercomputers and parallel processors for vector

processing and data parallelism. We classify supercomputers either as pipelined vector ,machines using a few powerful processors equipped with vector hardware, or as SIMD computers emphasizing massive data parallelism.

Vector SupercomputersA vector computer is often built on top of a scalar processor. As shown in following figure. The vector processor is attached to the scalar processor as an optional feature. Program and data are first loaded into the main memory thought a host computer. All instructions are first decoded by the scalar control operation, it will be directly executed by the scalar processor using the scalar functional pipelines.


79


Representative supercomputersOver a dozen pipelined vector computers have been manufactured, renging

from workstations to mini- and supercomputers.


80


SIMD SupercomputerYou know that an abstract

model of SIMD computers having a single instruction stream over multiple data stream. An operational model of SIMD computers is presented in the following figure.


81


SIMD Machine Model An operational model of an SIMD computer is specified by a 5-tuple:

Where

(1) N is the number of processing elements (PEs) in the machine. For example Illiac IV has 64 PEs and connection Machine CM-2 uses 65,536 PEs.

(2) C is the set of instructions directly executed by the control unit (CU), including scalar and program flow control instructions.

(3) I is the set of instructions broadcast by the CU to al PEs for parallel execution. These include arithmetic, logic, data routing, masking, and other local operations executed by each active PE over data within that PE.

(4) M is the set of masking schemes, where each mask partitions the set of the PEs into enabled and disabled subsets.

(5) R is the set of data-routing functions, specifying various patterns to be set up in the interconnection network for inter-PE communications.

One can describe a particular SIMD machine architecture by specifying the 5-tuple.

},,,,{ RMICNM


82


Representative SIMD ComputersThree SIMD supercomputers are summarized im the following table. The number

of PEs in these systems ranges from 4096 in the DAP610 to 16,384 in the MasPar MP-1 and 65,536 in CM-2. Both the CM-2 and DAP610 are fine-grain, bit-slice SIMD computers with attached floating-point accelerator for blocks of PEs.


83


Architectural Development TracksThe architectures of most existing computers follow certain development

tracks. Understanding features of various tracks provides insights for new architectural development. We look into six tracks to be studied in later chapters. These tracks are distinguished by similarity in computational models and technological bases.

Multiple-processor Tracks generally speaking, a multiple-processor system can be either a shared-

memory multiprocessor or a distributed-memory multicomputer

Message-Passing Track

The Cosmic Cube pioneered the development of message-passing multicomputers.


84


Shared-Memory Track The figure shows a track of multiprocessor development

employing a single address space in the entire system.


85


Multivector Track

These are traditional vector supercomputers. The CDC7600 was the first vector dual-processor system.


86


SIMD Track The Illiac IV pioneered the construction of SIMD computers,

even the array processor concept can be traced back for earlier to the 1960s.


87


Multithreaded and Dataflow TracksThese are two research tracks that have been

mainly experimented with in laboratoriesMultithreading Track

Multithreading idea war pioneered by Burton Smith (1978) in the HELP system which extended the concept of scoreboarding of multiple functional units in the CDC6400.


88


The Dataflow TrackThe key idea is to use a dataflow mechanism, instead of a

control-flow mechanism as in von Neumann machines, to direct the program flow. Fine-grain, instruction-level parallelism is exploited in dataflow computers.

Date post:	11-Jan-2016
Category:	Documents
Upload:	augustus-tucker
View:	219 times
Download:	3 times

A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy...

Documents