+ All Categories
Home > Documents > A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy...

A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy...

Date post: 11-Jan-2016
Category:
Upload: augustus-tucker
View: 219 times
Download: 3 times
Share this document with a friend
Popular Tags:
88
ADVANCED COMPUTER ARCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy [email protected] Iran University of Sciemce and Technology Computer Facaulty
Transcript
Page 1: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

ADVANCED COMPUTER

ARCHITECHTURE

Parallelism, Scalability, Programmability

Dr Mahmoud [email protected]

Iran University of Sciemce and TechnologyComputer Facaulty

Page 2: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

2

What is Computer Architecture

Page 3: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

3

Forces on Computer Architecture

Page 4: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

4

A Take on Moore’s Law

Page 5: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

5

A Take on Moore’s Law• Moore’s Law (1965)• Number of transistors per square inch doubled every year

• Reality: number of per square inch doubled every 18 months

• CPU Speed increases 54% per year

• DRAM Capacity increases 80% per year (Quadrupled every 3 years)

yearTechnology

Relative Performan

ce

1951Vacuum Tube1

1965Transistor35

1975Integrated Circuit900

1995VLSI2,400,000

Page 6: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

6

Processor Performance

Page 7: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

7

Cleaver Architecture Design

Page 8: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

8

Processor – Memory Performance Gap

Page 9: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

9

Technology Trend v.s. Power Dissipation

Page 10: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

10

“Hot” Computer

Importance of Low Power Processor Design

Page 11: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

11

Computer Food Chain

Page 12: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

12

Computer Engineering Methodology

Page 13: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

13

Measurement and Evaluation

Page 14: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

14

Measurement and Evaluation (contd)

Three component in computer architecture evaluation

- Simulators

- Benchmarks

- Evaluation Metrics (Performance, Cost, Power)

Page 15: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

15

A Computer Architecture Simulator

Page 16: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

16

A taxonomy of Simulator Tools

Page 17: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

17

Functional v.s. Performance Simulators

Page 18: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

18

Execution v.s. Trace-Driven Simulation

Page 19: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

19

Computer Performance

• History of Computer Performance• Execution time of a single instruction (such as addition)• Instruction mix\• MIPS• Mflops (with introducing supercomputers 1970-1980)• Real Programs (dificult running and different operation

systems)• Toy programs (system performance evaluation

cooperation)• 1988 SPEC company was established by SUN, MIPS,

DEC & Appolo

Page 20: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

20

SPEC History

• SPEC History• SPEC 89 CPU Intensive (6floating point + 4 integer point)• SPEC 92 (SPECintr, SPECFP), deleting programs such

as Matrix 300 from SPEC89• SPEC 95• SPEC 2000 (11 Integer, Cint 2000, 14 fp CFP 2000)

• SPEC viewperf (3D rendering)• SEC apc -Pro/Engineer

» . -Solid Works (3D CAD)» . -Graphic V15 (aircraft

design)

Page 21: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

21

Programs to evaluate Processor Performance

Page 22: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

22

Benchmarks

Page 23: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

23

Performance & Measuring

• The execution Time of a Program is the main measure of Computer Performance

• A Machine (X) is n% Faster than machine Y if :

1001

n

xoftimeexecution

yoftimeexecution

eperformanc

eperformanceperformanc

Y

YXn

100

Page 24: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

24

Performance & Measuring

Example:

If the Machine X executes a program in 10 Seconds and the Machine Y executes the Same program in 15 Seconds. The machine X is 50% Percent faster than Machine Y.

1001

n

imeofXExecutionT

imeofYExecutimeT

Page 25: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

25

Performance & Measuring (Amdal’s Law)

• SpeedUP

provementronTimeAfteTheExecuti

TimelExecutionTheOriginaSpeedup

Im

PTT

T

TSpeedUp

ss

Ts=The Sequential Time of the Program

P = The Degree of Parallelism

Page 26: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

26

Performance & Measuring (Amdal’s Law)

sT

TedUPMaximumSpe

Example: Assume that the processing power of a system have been

increased 10 times, But this part is just 40% of the all execution time. What is the Speedup?

56.1

104.0

6.0

1

SpeedUp

Page 27: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

27

Performance & Measuring (Amdal’s Law)

Example:The processing power of a CPU has been increased 5 times. But the

cost of the new CPU has been increased 5 times. The CPU time of the program is 50% and the CPU cost is 1/3 of the whole computer cost. Is this upgrade reasonable from cost to performance ratio point?

67.1

55.0

5.0

1

SpeedUP

33.253

11

3

2achineCostofNewM

Page 28: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

28

Performance & Measuring

Time in LINUX

Example:

Time 90.7 12.9 2:39 65% (CPU User) (System CPU Time) (Execution Time) (CPU Time /Execution Time)

CPU Time = (Cycles of the Program) / (Clock Rate)

CPU Time = (Cycles Period Of each Clock)

CPU Time = CPI (Clock Per Instruction) (Number of Instructions) (Period of each Clock)

Page 29: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

29

Performance & Measuring

• RISC Processor CPI (Small) No of Instructions (Large) Period of each Clock (small)

• CISC ProcessorCPI (Large) No of Instructions (Small) Period of each Clock (Large)

• MIPS=No of Instructions/Execution Time 106

• MIPS=Clock Rate/CPI 106

Page 30: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

30

Performance & Measuring

MIPS is not a good metric for performance for example I860 with 50MHz frequency has 100 MFlops Processing Power And 150 MOPS and R3000 (MIPS family Processor) is 16MFlops and 33 MOPS but can execute SPEC program 15% faster than I860.

EXAMPLE: (Showing that MIPS is not a good metric for performance evaluation) A Computer has got 3 types of Instructions with different CPI rates

Instruction TypeCPI

A 1

B 2

C 3

Page 31: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

31

Performance & Measuring

The compiler Designer has 2 choices to translate a high level language function.

A B C

Choice 1 2 1 2

Choice 2 4 1 1

What is the CPI of each choice?

CPI1 = 10 / 5 = 2

CPI2 = 9 / 6 = 1.5

Now the compiler designer for translating a program has two choices.

A B CChoice1 5 1 1 (million Instruction)

Choice2 10 1 1 (million Instruction)

What is the MIPS rate and Execution Time of Each Sequence?

Page 32: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

32

Performance & Measuring

Example:To show If we add a new instruction to a computer, How it effects

the performance of the system.

The instruction mix of a computer is as follows:

Operation Probability CPI

ALU 0.43 1

Load 0.21 2

Store 0.12 2

Branch 0.24 2

Assume that 25% of ALU operations use a Loaded Operand just One time. It means that this operand is not used in other next instructions. Now we want to add a new REG/MEM instruction type which is an ADD instruction and needs two cycles to execute. This change causes the branch instruction to be executed in 3 cycles. Whether this new machine is faster or the older?

Page 33: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

33

Performance & Measuring

57.1.....21.0*243.0*1*clockfCPIold

))43.0*25.0(1(

)2*)43.0*25.0(3*24.02*12.02*))43.*25.0(21.0(1))43.0*24.0(43.0((

NEWCPI

New CPU time = (0.893*Old_Instruction_Count)*1.908*Old CPU time = 1.57*Old_Instruction_Count*

New CPU time = 1.7 * Old_Instruction_Count * Therefore the old machine is faster.

7.1NEWCPI

Page 34: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

34

Performance & Measuring

Boeing 470 610Mph 286700Concord 132 1350Mph 178200

Relative MIPS = execution time of reference / (execution time of x * MIPS)

Weighted Megaflops :ADD, SUB, MUL 1DIV,SQR 4EXP,SIN 8

Bench Marks (Dhrystone)Vax 11/780 1.7KD/S INTEL i860 72.5SUN 4 16.8 MC 68040 40CRAY XMP 18.5 VAX 8600 6.4

Page 35: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

35

Performance & Measuring

Benchmark (Whetstone):

A FORTRAN floating point program

DEC 11/780 1.15 KW/S

IBM 4321 2

TP1 (A Database Benchmark):

VAX9000 70 TPS

Sequent 140 TPS

Benchmark for Intelligent Computers is measured in KLIPS (Kilo Logic Inference Per Second)

400 KLIPS 40 MIPS

Page 36: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

36

Performance & Measuring

XYXY

Execution Time on X

Execution Time on Y

Normalized on X

Normalized on X

Normalized on Y

Normalized on Y

Program A1101100.11

Program B100010010.11010

Arithmatic Mean500.55515.055.051

Geometric Mean31.631.61111

Page 37: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

37

Performance & Measuring

Example:

Two programs A & B are run on two machines X & Y. Which machine is faster?

XYXY

Execution Time on

X

Execution Time on

YNormalized on X

Normalized on X

Normalized on

Y

Normalized on

Y

Program A1101100.11

Program B100010010.11010

Arithmatic Mean

500.55515.055.051

Geometric Mean

31.631.61111

Page 38: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

38

Performance & Measuring

Example:

 ABC

Program 111020

Proigram 2100010020

Total100111040

n

j ji

i

TT

Weight

1

1*

1

Weight for program I on Machine A or B for N program, which Normalizes the execution time.

Page 39: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

39

Performance & Measuring

Page 40: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

40

Performance & Measuring

Page 41: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

41

Performance & Measuring

Page 42: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

42

Performance & Measuring

Page 43: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

43

Performance & Measuring

Page 44: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

44

Performance & Measuring

Page 45: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

45

Performance & Measuring

Page 46: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

46

Performance & Measuring

Page 47: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

47

Performance & Measuring

MIPS1 = Clock Rate/(CPI 106)CPI1 = 1.43CPI2 = 1.25MIPS1 = 69.4MIPS2 = 80

MIPS2>MIPS1CPU Time = No. of Instructions * CPI / Clock rateCPU Time1 = 0.1 SecCPU Time2 = 0.15 Sec

Compiler program 1 is faster than Compiler 2 but It has less MIPS rate ratio. So MIPS cannot be a good metric for performance Evaluation.

Page 48: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

48

Parallel Computer Models

Why Parallel Processing?

1990 1980 2000 2010 KIPS

MIPS

GIPS

TIPS

Pro

ce

sso

r pe

rfo

rma

nce

Calendar year

80286 68000

80386

80486 68040

Pentium

Pentium II R10000

1.6 / yr

Fig. 1.1 The exponential growth of microprocessor performance, known as Moore’s Law, shown over the past two decades (extrapolated).

Page 49: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

49

Parallel Computer Models

• The Semiconductor Technology Roadmap

Calendar year 200120042007201020132016

Halfpitch (nm)1409065453222

Clock freq. (GHz)247122030

Wiring levels789101010

Power supply (V)1.11.00.80.70.60.5

Max. power (W)130160190220250290From the 2001 edition of the roadmap [Alla02]

Factors contributing to the validity of Moore’s law Denser circuits; Architectural improvementsMeasures of processor performance Instructions/second (MIPS, GIPS, TIPS, PIPS) Floating-point operations per second (MFLOPS, GFLOPS, TFLOPS, PFLOPS) Running time on benchmark suites 1990 1980 2000 2010

KIPS

MIPS

GIPS

TIPS

Pro

cess

or

perf

orm

anc

e

Calendar year

80286 68000

80386

80486 68040

Pentium

Pentium II R10000

1.6 / yr

Page 50: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

50

Parallel Computer Models

• Why High-Performance Computing?

Higher speed (solve problems faster)Important when there are “hard” or “soft” deadlines; e.g., 24-hour weather forecast

Higher throughput (solve more problems)Important when there are many similar tasks to perform;e.g., transaction processing

Higher computational power (solve larger problems)e.g., weather forecast for a week rather than 24 hours,or with a finer mesh for greater accuracy

Categories of supercomputers Uniprocessor; aka vector machine Multiprocessor; centralized or distributed shared memory Multicomputer; communicating via message passing Massively parallel processor (MPP; 1K or more processors)

Page 51: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

51

Parallel Computer Models

• The Speed-of-Light Argument

The speed of light is about 30 cm/ns.

Signals travel at a fraction of speed of light (say, 1/3).

If signals must travel 1 cm during the execution of an instruction, that instruction will take at least 0.1 ns; thus, performance will be limited to 10 GIPS.

This limitation is eased by continued miniaturization, architectural methods such as cache memory, etc.; however, a fundamental limit does exist.

How does parallel processing help? Wouldn’t multiple processors need to communicate via signals as well?

Page 52: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

52

Parallel Computer Models

• The Quest for Higher Performance Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)

1 .IBM Blue Gene/L2 .SGI Columbia3 .NEC Earth SimLLNL, CaliforniaNASA Ames, CaliforniaEarth Sim Ctr, Yokohama

Material science, nuclear stockpile sim

Aerospace/space sim, climate research

Atmospheric, oceanic, and earth sciences

32,768 proc’s, 8 TB, 28 TB disk storage

10,240 proc’s, 20 TB, 440 TB disk storage

5,120 proc’s, 10 TB, 700 TB disk storage

Linux + custom OSLinuxUnix

71 TFLOPS, $100 M52 TFLOPS, $50 M36 TFLOPS*, $400 M?

Dual-proc Power-PC chips (10-15 W power)

20x Altix (512 Itanium2) linked by Infiniband

Built of custom vector microprocessors

Full system: 130k-proc, 360 TFLOPS (est)

Volume = 50x IBM, Power = 14x IBM

Page 53: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

53

Parallel Computer Models

• Supercomputer Performance Growth

1990 1980 2000 2010 MFLOPS

Su

perc

om

put

er

perf

orm

anc

e

Calendar year

Cray X-MP

Y-MP

CM-2

GFLOPS

TFLOPS

PFLOPS

Vector supers

CM-5

CM-5

$240M MPPs

$30M MPPs

ASCI goals

Micros

80386

80860

Alpha

The exponential growth in supercomputer performance over the past two decades (from [Bell92], with ASCI performance goals and microprocessor peak FLOPS superimposed as dotted lines).

Page 54: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

54

One Reason for Sublinear Speedup:Communication Overhead

Number of processors

Communication

Computation

Solution time

Ideal speedup

Number of processors

Actual speedup

Trade-off between communication time and computation time in the data-parallel realization

Page 55: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

55

Another Reason for Sublinear Speedup:Input/Output Overhead

Number of processors

I/O time

Computation

Solution time

Ideal speedup

Number of processors

Actual speedup

Effect of a constant I/O time on the data-parallel realization

Page 56: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

56

Trends in High-Technology Development

1960 1970 1980 1990 2000

Graphics

Networking

RISC

Parallelism

GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes

IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes

IndDevIndDev

GovResGovResGovResG GovResGovResGovResGo

GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes

IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes

IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes

GovRes

IndDev

IndResIndR

$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B

Transfer of ideas/people

Development of some technical fields into $1B businesses and the roles played by government research and industrial R&D over time (IEEE Computer, early 90s?).

Page 57: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

57

Trends in Hi-Tech Development (2003)

Page 58: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

58

Status of Computing Power (circa 2000)

GFLOPS on desktop: Apple Macintosh, with G4 processor

TFLOPS in supercomputer center: 1152-processor IBM RS/6000 SP (switch-based network) Cray T3E, torus-connected

PFLOPS on drawing board: 1M-processor IBM Blue Gene (2005?) 32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers Processor: 8 threads, on-chip memory, no data cache Chip: defect-tolerant, row/column rings in a 6 6 array Board: 8 8 chip grid organized as 4 4 4 cube Tower: Boards linked to 4 neighbors in adjacent towers System: 323232 cube of chips, 1.5 MW (water-cooled)

Page 59: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

59

Parallel Computer Models

• Parallel Processing on Single Processor Computers

• 1- Using of Multi Operational Units

• 2- Parallelism and Pipeline Inside a CPU

• 3- Overlapping the Operations of I/O & CPU

• 4- Making Equilibrium in Bandwidth of Subsystems

• 4-1- Bandwidth of CPU (high)

• 4-2- Bandwidth of Memory (less)

• 4-3- Bandwidth of I/O (very little)

• 5- Hierarchy of Memory

• 5-1- register memory 5-2- cache memory

• 5-3- main memory 5-4- secondary memory

• 6- Using of Multi Programs and Time Sharing

Page 60: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

60

Types of Parallelism: A Taxonomy

SISD

SIMD

MISD

MIMD

GMSV

GMMP

DMSV

DMMP

Single data stream

Mult iple data streams

Sin

gle

inst

r st

ream

M

ultip

le in

str

stre

ams

Flynn’s categories

Joh

nso

n’s

ex

pan

sio

n

Shared variables

Message passing

Glo

bal

me

mor

y D

istr

ibut

ed

me

mor

y

Uniprocessors

Rarely used

Array or vector processors

Mult iproc’s or mult icomputers

Shared-memory mult iprocessors

Rarely used

Distributed shared memory

Distrib-memory mult icomputers

The Flynn-Johnson classification of computer systems.

Page 61: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

61

Parallel Computer Models Parallel Computer Models

Flynn’s classification of computer architectures.

Page 62: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

62

Parallel Computer Models

• Flynn’s classification of computer architectures (Contd)

Page 63: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

63

Parallel Computer Models

SISD “Uniprocessor”

SIMD “Array processor”

MISD (Rarely used)

MIMD GMSV GMMP

DMSV DMMP

“Shared-memory multiprocessor”

“Distributed shared memory”

“Distrib-memory multicomputer

Data stream(s) C

ont

rol s

tre

am

(s)

Single Multiple M

ultip

le

Sin

gle

Me

mo

ry

Dis

trib

G

lob

al

Communication/Synchronization

Shared variables

Message passing

SIMD versus MIMD

Global versus

Distributed memory

Page 64: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

64

SIMD versus MIMD Architectures

Most early parallel machines had SIMD designs Attractive to have skeleton processors (PEs) Eventually, many processors per chip High development cost for custom chips, high cost MSIMD and SPMD variants

Most modern parallel machines have MIMD designs COTS components (CPU chips and switches) MPP: Massively or moderately parallel? Tightly coupled versus loosely coupled Explicit message passing versus shared memory

Network-based NOWs and COWs Networks/Clusters of workstations

Grid computing Vision: Plug into wall outlets for computing power

1960

1970

1980

1990

2000

2010

ILLIAC IV

TMC CM-2

Goodyear MPP

DAP

MasPar MP-1

Clearspeed array coproc

SIMD Timeline

Page 65: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

65

Global versus Distributed Memory

Fig. 4.3 A parallel processor with global memory.

0 0

1 1

Processor-to-memory

network

p-1 m-1

Processor-to-processor

network

Processors Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

Options:CrossbarBus(es)MIN

BottleneckComplexExpensive

Page 66: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

66

Removing the Processor-to-Memory Bottleneck

A parallel processor with global memory and processor caches.

0 0

1 1

Processor-to-memory

network

p-1 m-1

Processor-to-processor

network

Processors Caches Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

Challenge:Cache coherence

Page 67: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

67

Distributed Shared Memory

0

1

Interconnection network

p-1

Processors

Parallel I/O

.

.

.

.

.

.

Memories

Some Terminology:

NUMANonuniform memory access(distributed shared memory)

UMAUniform memory access(global shared memory)

COMACache-only memory arch

Page 68: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

68

Parallel Computer Models

اتصالی • های شبکه• Fully connected

• Hypercube

• Mesh

• Ring

• Cube

• Star

• .

• .

• .

Page 69: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

69

Parallel Computer Models

• Multiprocessors and Multicomputers• Shared-Memory Multiprocessor • The UMA (Uniform Menory Access) Model• In a UMA Multiprocessor model , the physical memory is uniformly shared

by all the processors. All processors have equal access time to all memory words, which is why it is called uniform memory access.

• Multiprocessors are called tightly• coupled systems due to the degree• Of resource sharing.

Page 70: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

70

Parallel Computer Models

symmetric multiprocessorWhen all processors have equal access to all

peripheral devices, the system is called a symmetric multiprocessor. In such a case, all processors are equally capable of running the executive programs.

In a asymmetric multiprocessor, only one or a subset of processors are executive-capable.

The remaining processors have no I/O capability are called attached processors.

Page 71: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

71

Parallel Computer Models

The NUMA (Uniform Menory Access) ModelA NUMA multiprocessor is a shared-memory system in which the

access time varies with the location of the memory word

Page 72: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

72

Parallel Computer Models

Besides distributed memories, globally shared memory can be added to multiprocessor system. In this case, there are three memory-access pattern: the fastest is local memory access. The next is global memory access. The slowest is access of remote memory as illustrated in this picture.

Page 73: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

73

Parallel Computer Models

The COMA ModelA multi processor using cache-only memory assumes the COMA model. This

model is depicted the following picture.

The COMA model is a special case of NUMA machine, in which the distributed main memories are converted to caches. There is no memory hierarchy at each processor node.

Besides the UMA, NUMA and coma models specified above, other variations exist for multiprocessors. For example, a cache-coherent non-uniform memory access (CC-NUMA) model can be specified with distributed shared memory and cache directories.

Page 74: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

74

Parallel Computer Models

Representative Multicomputers.Several commercialy multiprocessors are summerized in the following table:

Page 75: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

75

Parallel Computer Models

Distributed-Memory MultiprocessorA distributed-Memory Multiprocessor system is modeled in the

following figure. The system consists of multiple computers, often called nodes, interconnected by a message-passing network. Each node is an autonomous computer consisting of a processor, local memory, and sometimes attached disks or I/O peripherals.

The message-passing network provides point to point static connections among the nodes. All local memories are private and accessible only by local processors. For this resean, traditional multicomputers have been called no-remote-memory-access (NORMA) machines.

Page 76: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

76

Parallel Computer Models

Representative MulticomputersThree message-passing multicomputers are summarized in the following

table. With distributed processor/memory nodes, these machines are better in achieving a scalable performance. However, message passing imposes a hardship on programmers to distribute the computations and data sets over the nodes or to establish communication among nodes.

Page 77: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

77

Parallel Computer Models

A Taxonomy of MIMD Computers

Parallel computers appear as either SIMD or MIMD configurations. The SIMDs appeal more to special-purpose applications. It is clear that SIMDs are not size-scalable, but unclear whether large SIMDs are generation-scalable. The fact that CM-5 has an MIMD architecture, away from the SIMD architecture in CM-2, may shed some light on the architectural trend. Furthermore, the boundary between multiprocessors and multicomputers has become blurred in recent years, eventually, the distinction may vanish.

Page 78: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

78

Parallel Computer Models

Multivector and SIMD ComputersHere we introduce supercomputers and parallel processors for vector

processing and data parallelism. We classify supercomputers either as pipelined vector ,machines using a few powerful processors equipped with vector hardware, or as SIMD computers emphasizing massive data parallelism.

Vector SupercomputersA vector computer is often built on top of a scalar processor. As shown in following figure. The vector processor is attached to the scalar processor as an optional feature. Program and data are first loaded into the main memory thought a host computer. All instructions are first decoded by the scalar control operation, it will be directly executed by the scalar processor using the scalar functional pipelines.

Page 79: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

79

Parallel Computer Models

Representative supercomputersOver a dozen pipelined vector computers have been manufactured, renging

from workstations to mini- and supercomputers.

Page 80: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

80

Parallel Computer Models

SIMD SupercomputerYou know that an abstract

model of SIMD computers having a single instruction stream over multiple data stream. An operational model of SIMD computers is presented in the following figure.

Page 81: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

81

Parallel Computer Models

SIMD Machine Model An operational model of an SIMD computer is specified by a 5-tuple:

Where

(1) N is the number of processing elements (PEs) in the machine. For example Illiac IV has 64 PEs and connection Machine CM-2 uses 65,536 PEs.

(2) C is the set of instructions directly executed by the control unit (CU), including scalar and program flow control instructions.

(3) I is the set of instructions broadcast by the CU to al PEs for parallel execution. These include arithmetic, logic, data routing, masking, and other local operations executed by each active PE over data within that PE.

(4) M is the set of masking schemes, where each mask partitions the set of the PEs into enabled and disabled subsets.

(5) R is the set of data-routing functions, specifying various patterns to be set up in the interconnection network for inter-PE communications.

One can describe a particular SIMD machine architecture by specifying the 5-tuple.

},,,,{ RMICNM

Page 82: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

82

Parallel Computer Models

Representative SIMD ComputersThree SIMD supercomputers are summarized im the following table. The number

of PEs in these systems ranges from 4096 in the DAP610 to 16,384 in the MasPar MP-1 and 65,536 in CM-2. Both the CM-2 and DAP610 are fine-grain, bit-slice SIMD computers with attached floating-point accelerator for blocks of PEs.

Page 83: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

83

Parallel Computer Models

Architectural Development TracksThe architectures of most existing computers follow certain development

tracks. Understanding features of various tracks provides insights for new architectural development. We look into six tracks to be studied in later chapters. These tracks are distinguished by similarity in computational models and technological bases.

Multiple-processor Tracks generally speaking, a multiple-processor system can be either a shared-

memory multiprocessor or a distributed-memory multicomputer

Message-Passing Track

The Cosmic Cube pioneered the development of message-passing multicomputers.

Page 84: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

84

Parallel Computer Models

Shared-Memory Track The figure shows a track of multiprocessor development

employing a single address space in the entire system.

Page 85: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

85

Parallel Computer Models

Multivector Track

These are traditional vector supercomputers. The CDC7600 was the first vector dual-processor system.

Page 86: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

86

Parallel Computer Models

SIMD Track The Illiac IV pioneered the construction of SIMD computers,

even the array processor concept can be traced back for earlier to the 1960s.

Page 87: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

87

Parallel Computer Models

Multithreaded and Dataflow TracksThese are two research tracks that have been

mainly experimented with in laboratoriesMultithreading Track

Multithreading idea war pioneered by Burton Smith (1978) in the HELP system which extended the concept of scoreboarding of multiple functional units in the CDC6400.

Page 88: A DVANCED C OMPUTER A RCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy mahfathy@iust.ac.ir Iran University of Sciemce and Technology.

Advanced Computer Architecture Dr Fathy

88

Parallel Computer Models

The Dataflow TrackThe key idea is to use a dataflow mechanism, instead of a

control-flow mechanism as in von Neumann machines, to direct the program flow. Fine-grain, instruction-level parallelism is exploited in dataflow computers.


Recommended