Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | augustus-tucker |
View: | 219 times |
Download: | 3 times |
ADVANCED COMPUTER
ARCHITECHTURE
Parallelism, Scalability, Programmability
Dr Mahmoud [email protected]
Iran University of Sciemce and TechnologyComputer Facaulty
Advanced Computer Architecture Dr Fathy
2
What is Computer Architecture
Advanced Computer Architecture Dr Fathy
3
Forces on Computer Architecture
Advanced Computer Architecture Dr Fathy
4
A Take on Moore’s Law
Advanced Computer Architecture Dr Fathy
5
A Take on Moore’s Law• Moore’s Law (1965)• Number of transistors per square inch doubled every year
• Reality: number of per square inch doubled every 18 months
• CPU Speed increases 54% per year
• DRAM Capacity increases 80% per year (Quadrupled every 3 years)
yearTechnology
Relative Performan
ce
1951Vacuum Tube1
1965Transistor35
1975Integrated Circuit900
1995VLSI2,400,000
Advanced Computer Architecture Dr Fathy
6
Processor Performance
Advanced Computer Architecture Dr Fathy
7
Cleaver Architecture Design
Advanced Computer Architecture Dr Fathy
8
Processor – Memory Performance Gap
Advanced Computer Architecture Dr Fathy
9
Technology Trend v.s. Power Dissipation
Advanced Computer Architecture Dr Fathy
10
“Hot” Computer
Importance of Low Power Processor Design
Advanced Computer Architecture Dr Fathy
11
Computer Food Chain
Advanced Computer Architecture Dr Fathy
12
Computer Engineering Methodology
Advanced Computer Architecture Dr Fathy
13
Measurement and Evaluation
Advanced Computer Architecture Dr Fathy
14
Measurement and Evaluation (contd)
Three component in computer architecture evaluation
- Simulators
- Benchmarks
- Evaluation Metrics (Performance, Cost, Power)
Advanced Computer Architecture Dr Fathy
15
A Computer Architecture Simulator
Advanced Computer Architecture Dr Fathy
16
A taxonomy of Simulator Tools
Advanced Computer Architecture Dr Fathy
17
Functional v.s. Performance Simulators
Advanced Computer Architecture Dr Fathy
18
Execution v.s. Trace-Driven Simulation
Advanced Computer Architecture Dr Fathy
19
Computer Performance
• History of Computer Performance• Execution time of a single instruction (such as addition)• Instruction mix\• MIPS• Mflops (with introducing supercomputers 1970-1980)• Real Programs (dificult running and different operation
systems)• Toy programs (system performance evaluation
cooperation)• 1988 SPEC company was established by SUN, MIPS,
DEC & Appolo
Advanced Computer Architecture Dr Fathy
20
SPEC History
• SPEC History• SPEC 89 CPU Intensive (6floating point + 4 integer point)• SPEC 92 (SPECintr, SPECFP), deleting programs such
as Matrix 300 from SPEC89• SPEC 95• SPEC 2000 (11 Integer, Cint 2000, 14 fp CFP 2000)
• SPEC viewperf (3D rendering)• SEC apc -Pro/Engineer
» . -Solid Works (3D CAD)» . -Graphic V15 (aircraft
design)
Advanced Computer Architecture Dr Fathy
21
Programs to evaluate Processor Performance
Advanced Computer Architecture Dr Fathy
22
Benchmarks
Advanced Computer Architecture Dr Fathy
23
Performance & Measuring
• The execution Time of a Program is the main measure of Computer Performance
• A Machine (X) is n% Faster than machine Y if :
1001
n
xoftimeexecution
yoftimeexecution
eperformanc
eperformanceperformanc
Y
YXn
100
Advanced Computer Architecture Dr Fathy
24
Performance & Measuring
Example:
If the Machine X executes a program in 10 Seconds and the Machine Y executes the Same program in 15 Seconds. The machine X is 50% Percent faster than Machine Y.
1001
n
imeofXExecutionT
imeofYExecutimeT
Advanced Computer Architecture Dr Fathy
25
Performance & Measuring (Amdal’s Law)
• SpeedUP
provementronTimeAfteTheExecuti
TimelExecutionTheOriginaSpeedup
Im
PTT
T
TSpeedUp
ss
Ts=The Sequential Time of the Program
P = The Degree of Parallelism
Advanced Computer Architecture Dr Fathy
26
Performance & Measuring (Amdal’s Law)
sT
TedUPMaximumSpe
Example: Assume that the processing power of a system have been
increased 10 times, But this part is just 40% of the all execution time. What is the Speedup?
56.1
104.0
6.0
1
SpeedUp
Advanced Computer Architecture Dr Fathy
27
Performance & Measuring (Amdal’s Law)
Example:The processing power of a CPU has been increased 5 times. But the
cost of the new CPU has been increased 5 times. The CPU time of the program is 50% and the CPU cost is 1/3 of the whole computer cost. Is this upgrade reasonable from cost to performance ratio point?
67.1
55.0
5.0
1
SpeedUP
33.253
11
3
2achineCostofNewM
Advanced Computer Architecture Dr Fathy
28
Performance & Measuring
Time in LINUX
Example:
Time 90.7 12.9 2:39 65% (CPU User) (System CPU Time) (Execution Time) (CPU Time /Execution Time)
CPU Time = (Cycles of the Program) / (Clock Rate)
CPU Time = (Cycles Period Of each Clock)
CPU Time = CPI (Clock Per Instruction) (Number of Instructions) (Period of each Clock)
Advanced Computer Architecture Dr Fathy
29
Performance & Measuring
• RISC Processor CPI (Small) No of Instructions (Large) Period of each Clock (small)
• CISC ProcessorCPI (Large) No of Instructions (Small) Period of each Clock (Large)
• MIPS=No of Instructions/Execution Time 106
• MIPS=Clock Rate/CPI 106
Advanced Computer Architecture Dr Fathy
30
Performance & Measuring
MIPS is not a good metric for performance for example I860 with 50MHz frequency has 100 MFlops Processing Power And 150 MOPS and R3000 (MIPS family Processor) is 16MFlops and 33 MOPS but can execute SPEC program 15% faster than I860.
EXAMPLE: (Showing that MIPS is not a good metric for performance evaluation) A Computer has got 3 types of Instructions with different CPI rates
Instruction TypeCPI
A 1
B 2
C 3
Advanced Computer Architecture Dr Fathy
31
Performance & Measuring
The compiler Designer has 2 choices to translate a high level language function.
A B C
Choice 1 2 1 2
Choice 2 4 1 1
What is the CPI of each choice?
CPI1 = 10 / 5 = 2
CPI2 = 9 / 6 = 1.5
Now the compiler designer for translating a program has two choices.
A B CChoice1 5 1 1 (million Instruction)
Choice2 10 1 1 (million Instruction)
What is the MIPS rate and Execution Time of Each Sequence?
Advanced Computer Architecture Dr Fathy
32
Performance & Measuring
Example:To show If we add a new instruction to a computer, How it effects
the performance of the system.
The instruction mix of a computer is as follows:
Operation Probability CPI
ALU 0.43 1
Load 0.21 2
Store 0.12 2
Branch 0.24 2
Assume that 25% of ALU operations use a Loaded Operand just One time. It means that this operand is not used in other next instructions. Now we want to add a new REG/MEM instruction type which is an ADD instruction and needs two cycles to execute. This change causes the branch instruction to be executed in 3 cycles. Whether this new machine is faster or the older?
Advanced Computer Architecture Dr Fathy
33
Performance & Measuring
57.1.....21.0*243.0*1*clockfCPIold
))43.0*25.0(1(
)2*)43.0*25.0(3*24.02*12.02*))43.*25.0(21.0(1))43.0*24.0(43.0((
NEWCPI
New CPU time = (0.893*Old_Instruction_Count)*1.908*Old CPU time = 1.57*Old_Instruction_Count*
New CPU time = 1.7 * Old_Instruction_Count * Therefore the old machine is faster.
7.1NEWCPI
Advanced Computer Architecture Dr Fathy
34
Performance & Measuring
Boeing 470 610Mph 286700Concord 132 1350Mph 178200
Relative MIPS = execution time of reference / (execution time of x * MIPS)
Weighted Megaflops :ADD, SUB, MUL 1DIV,SQR 4EXP,SIN 8
Bench Marks (Dhrystone)Vax 11/780 1.7KD/S INTEL i860 72.5SUN 4 16.8 MC 68040 40CRAY XMP 18.5 VAX 8600 6.4
Advanced Computer Architecture Dr Fathy
35
Performance & Measuring
Benchmark (Whetstone):
A FORTRAN floating point program
DEC 11/780 1.15 KW/S
IBM 4321 2
TP1 (A Database Benchmark):
VAX9000 70 TPS
Sequent 140 TPS
Benchmark for Intelligent Computers is measured in KLIPS (Kilo Logic Inference Per Second)
400 KLIPS 40 MIPS
Advanced Computer Architecture Dr Fathy
36
Performance & Measuring
XYXY
Execution Time on X
Execution Time on Y
Normalized on X
Normalized on X
Normalized on Y
Normalized on Y
Program A1101100.11
Program B100010010.11010
Arithmatic Mean500.55515.055.051
Geometric Mean31.631.61111
Advanced Computer Architecture Dr Fathy
37
Performance & Measuring
Example:
Two programs A & B are run on two machines X & Y. Which machine is faster?
XYXY
Execution Time on
X
Execution Time on
YNormalized on X
Normalized on X
Normalized on
Y
Normalized on
Y
Program A1101100.11
Program B100010010.11010
Arithmatic Mean
500.55515.055.051
Geometric Mean
31.631.61111
Advanced Computer Architecture Dr Fathy
38
Performance & Measuring
Example:
ABC
Program 111020
Proigram 2100010020
Total100111040
n
j ji
i
TT
Weight
1
1*
1
Weight for program I on Machine A or B for N program, which Normalizes the execution time.
Advanced Computer Architecture Dr Fathy
39
Performance & Measuring
Advanced Computer Architecture Dr Fathy
40
Performance & Measuring
Advanced Computer Architecture Dr Fathy
41
Performance & Measuring
Advanced Computer Architecture Dr Fathy
42
Performance & Measuring
Advanced Computer Architecture Dr Fathy
43
Performance & Measuring
Advanced Computer Architecture Dr Fathy
44
Performance & Measuring
Advanced Computer Architecture Dr Fathy
45
Performance & Measuring
Advanced Computer Architecture Dr Fathy
46
Performance & Measuring
Advanced Computer Architecture Dr Fathy
47
Performance & Measuring
MIPS1 = Clock Rate/(CPI 106)CPI1 = 1.43CPI2 = 1.25MIPS1 = 69.4MIPS2 = 80
MIPS2>MIPS1CPU Time = No. of Instructions * CPI / Clock rateCPU Time1 = 0.1 SecCPU Time2 = 0.15 Sec
Compiler program 1 is faster than Compiler 2 but It has less MIPS rate ratio. So MIPS cannot be a good metric for performance Evaluation.
Advanced Computer Architecture Dr Fathy
48
Parallel Computer Models
Why Parallel Processing?
1990 1980 2000 2010 KIPS
MIPS
GIPS
TIPS
Pro
ce
sso
r pe
rfo
rma
nce
Calendar year
80286 68000
80386
80486 68040
Pentium
Pentium II R10000
1.6 / yr
Fig. 1.1 The exponential growth of microprocessor performance, known as Moore’s Law, shown over the past two decades (extrapolated).
Advanced Computer Architecture Dr Fathy
49
Parallel Computer Models
• The Semiconductor Technology Roadmap
Calendar year 200120042007201020132016
Halfpitch (nm)1409065453222
Clock freq. (GHz)247122030
Wiring levels789101010
Power supply (V)1.11.00.80.70.60.5
Max. power (W)130160190220250290From the 2001 edition of the roadmap [Alla02]
Factors contributing to the validity of Moore’s law Denser circuits; Architectural improvementsMeasures of processor performance Instructions/second (MIPS, GIPS, TIPS, PIPS) Floating-point operations per second (MFLOPS, GFLOPS, TFLOPS, PFLOPS) Running time on benchmark suites 1990 1980 2000 2010
KIPS
MIPS
GIPS
TIPS
Pro
cess
or
perf
orm
anc
e
Calendar year
80286 68000
80386
80486 68040
Pentium
Pentium II R10000
1.6 / yr
Advanced Computer Architecture Dr Fathy
50
Parallel Computer Models
• Why High-Performance Computing?
Higher speed (solve problems faster)Important when there are “hard” or “soft” deadlines; e.g., 24-hour weather forecast
Higher throughput (solve more problems)Important when there are many similar tasks to perform;e.g., transaction processing
Higher computational power (solve larger problems)e.g., weather forecast for a week rather than 24 hours,or with a finer mesh for greater accuracy
Categories of supercomputers Uniprocessor; aka vector machine Multiprocessor; centralized or distributed shared memory Multicomputer; communicating via message passing Massively parallel processor (MPP; 1K or more processors)
Advanced Computer Architecture Dr Fathy
51
Parallel Computer Models
• The Speed-of-Light Argument
The speed of light is about 30 cm/ns.
Signals travel at a fraction of speed of light (say, 1/3).
If signals must travel 1 cm during the execution of an instruction, that instruction will take at least 0.1 ns; thus, performance will be limited to 10 GIPS.
This limitation is eased by continued miniaturization, architectural methods such as cache memory, etc.; however, a fundamental limit does exist.
How does parallel processing help? Wouldn’t multiple processors need to communicate via signals as well?
Advanced Computer Architecture Dr Fathy
52
Parallel Computer Models
• The Quest for Higher Performance Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)
1 .IBM Blue Gene/L2 .SGI Columbia3 .NEC Earth SimLLNL, CaliforniaNASA Ames, CaliforniaEarth Sim Ctr, Yokohama
Material science, nuclear stockpile sim
Aerospace/space sim, climate research
Atmospheric, oceanic, and earth sciences
32,768 proc’s, 8 TB, 28 TB disk storage
10,240 proc’s, 20 TB, 440 TB disk storage
5,120 proc’s, 10 TB, 700 TB disk storage
Linux + custom OSLinuxUnix
71 TFLOPS, $100 M52 TFLOPS, $50 M36 TFLOPS*, $400 M?
Dual-proc Power-PC chips (10-15 W power)
20x Altix (512 Itanium2) linked by Infiniband
Built of custom vector microprocessors
Full system: 130k-proc, 360 TFLOPS (est)
Volume = 50x IBM, Power = 14x IBM
Advanced Computer Architecture Dr Fathy
53
Parallel Computer Models
• Supercomputer Performance Growth
1990 1980 2000 2010 MFLOPS
Su
perc
om
put
er
perf
orm
anc
e
Calendar year
Cray X-MP
Y-MP
CM-2
GFLOPS
TFLOPS
PFLOPS
Vector supers
CM-5
CM-5
$240M MPPs
$30M MPPs
ASCI goals
Micros
80386
80860
Alpha
The exponential growth in supercomputer performance over the past two decades (from [Bell92], with ASCI performance goals and microprocessor peak FLOPS superimposed as dotted lines).
Advanced Computer Architecture Dr Fathy
54
One Reason for Sublinear Speedup:Communication Overhead
Number of processors
Communication
Computation
Solution time
Ideal speedup
Number of processors
Actual speedup
Trade-off between communication time and computation time in the data-parallel realization
Advanced Computer Architecture Dr Fathy
55
Another Reason for Sublinear Speedup:Input/Output Overhead
Number of processors
I/O time
Computation
Solution time
Ideal speedup
Number of processors
Actual speedup
Effect of a constant I/O time on the data-parallel realization
Advanced Computer Architecture Dr Fathy
56
Trends in High-Technology Development
1960 1970 1980 1990 2000
Graphics
Networking
RISC
Parallelism
GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes
IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes
IndDevIndDev
GovResGovResGovResG GovResGovResGovResGo
GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes
IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes
IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1
IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes
GovRes
IndDev
IndResIndR
$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1
IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1
$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B
Transfer of ideas/people
Development of some technical fields into $1B businesses and the roles played by government research and industrial R&D over time (IEEE Computer, early 90s?).
Advanced Computer Architecture Dr Fathy
57
Trends in Hi-Tech Development (2003)
Advanced Computer Architecture Dr Fathy
58
Status of Computing Power (circa 2000)
GFLOPS on desktop: Apple Macintosh, with G4 processor
TFLOPS in supercomputer center: 1152-processor IBM RS/6000 SP (switch-based network) Cray T3E, torus-connected
PFLOPS on drawing board: 1M-processor IBM Blue Gene (2005?) 32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers Processor: 8 threads, on-chip memory, no data cache Chip: defect-tolerant, row/column rings in a 6 6 array Board: 8 8 chip grid organized as 4 4 4 cube Tower: Boards linked to 4 neighbors in adjacent towers System: 323232 cube of chips, 1.5 MW (water-cooled)
Advanced Computer Architecture Dr Fathy
59
Parallel Computer Models
• Parallel Processing on Single Processor Computers
• 1- Using of Multi Operational Units
• 2- Parallelism and Pipeline Inside a CPU
• 3- Overlapping the Operations of I/O & CPU
• 4- Making Equilibrium in Bandwidth of Subsystems
• 4-1- Bandwidth of CPU (high)
• 4-2- Bandwidth of Memory (less)
• 4-3- Bandwidth of I/O (very little)
• 5- Hierarchy of Memory
• 5-1- register memory 5-2- cache memory
• 5-3- main memory 5-4- secondary memory
• 6- Using of Multi Programs and Time Sharing
Advanced Computer Architecture Dr Fathy
60
Types of Parallelism: A Taxonomy
SISD
SIMD
MISD
MIMD
GMSV
GMMP
DMSV
DMMP
Single data stream
Mult iple data streams
Sin
gle
inst
r st
ream
M
ultip
le in
str
stre
ams
Flynn’s categories
Joh
nso
n’s
ex
pan
sio
n
Shared variables
Message passing
Glo
bal
me
mor
y D
istr
ibut
ed
me
mor
y
Uniprocessors
Rarely used
Array or vector processors
Mult iproc’s or mult icomputers
Shared-memory mult iprocessors
Rarely used
Distributed shared memory
Distrib-memory mult icomputers
The Flynn-Johnson classification of computer systems.
Advanced Computer Architecture Dr Fathy
61
Parallel Computer Models Parallel Computer Models
Flynn’s classification of computer architectures.
Advanced Computer Architecture Dr Fathy
62
Parallel Computer Models
• Flynn’s classification of computer architectures (Contd)
Advanced Computer Architecture Dr Fathy
63
Parallel Computer Models
SISD “Uniprocessor”
SIMD “Array processor”
MISD (Rarely used)
MIMD GMSV GMMP
DMSV DMMP
“Shared-memory multiprocessor”
“Distributed shared memory”
“Distrib-memory multicomputer
Data stream(s) C
ont
rol s
tre
am
(s)
Single Multiple M
ultip
le
Sin
gle
Me
mo
ry
Dis
trib
G
lob
al
Communication/Synchronization
Shared variables
Message passing
SIMD versus MIMD
Global versus
Distributed memory
Advanced Computer Architecture Dr Fathy
64
SIMD versus MIMD Architectures
Most early parallel machines had SIMD designs Attractive to have skeleton processors (PEs) Eventually, many processors per chip High development cost for custom chips, high cost MSIMD and SPMD variants
Most modern parallel machines have MIMD designs COTS components (CPU chips and switches) MPP: Massively or moderately parallel? Tightly coupled versus loosely coupled Explicit message passing versus shared memory
Network-based NOWs and COWs Networks/Clusters of workstations
Grid computing Vision: Plug into wall outlets for computing power
1960
1970
1980
1990
2000
2010
ILLIAC IV
TMC CM-2
Goodyear MPP
DAP
MasPar MP-1
Clearspeed array coproc
SIMD Timeline
Advanced Computer Architecture Dr Fathy
65
Global versus Distributed Memory
Fig. 4.3 A parallel processor with global memory.
0 0
1 1
Processor-to-memory
network
p-1 m-1
Processor-to-processor
network
Processors Memory modules
Parallel I/O
. . .
.
.
.
.
.
.
Options:CrossbarBus(es)MIN
BottleneckComplexExpensive
Advanced Computer Architecture Dr Fathy
66
Removing the Processor-to-Memory Bottleneck
A parallel processor with global memory and processor caches.
0 0
1 1
Processor-to-memory
network
p-1 m-1
Processor-to-processor
network
Processors Caches Memory modules
Parallel I/O
. . .
.
.
.
.
.
.
Challenge:Cache coherence
Advanced Computer Architecture Dr Fathy
67
Distributed Shared Memory
0
1
Interconnection network
p-1
Processors
Parallel I/O
.
.
.
.
.
.
Memories
Some Terminology:
NUMANonuniform memory access(distributed shared memory)
UMAUniform memory access(global shared memory)
COMACache-only memory arch
Advanced Computer Architecture Dr Fathy
68
Parallel Computer Models
اتصالی • های شبکه• Fully connected
• Hypercube
• Mesh
• Ring
• Cube
• Star
• .
• .
• .
Advanced Computer Architecture Dr Fathy
69
Parallel Computer Models
• Multiprocessors and Multicomputers• Shared-Memory Multiprocessor • The UMA (Uniform Menory Access) Model• In a UMA Multiprocessor model , the physical memory is uniformly shared
by all the processors. All processors have equal access time to all memory words, which is why it is called uniform memory access.
• Multiprocessors are called tightly• coupled systems due to the degree• Of resource sharing.
Advanced Computer Architecture Dr Fathy
70
Parallel Computer Models
symmetric multiprocessorWhen all processors have equal access to all
peripheral devices, the system is called a symmetric multiprocessor. In such a case, all processors are equally capable of running the executive programs.
In a asymmetric multiprocessor, only one or a subset of processors are executive-capable.
The remaining processors have no I/O capability are called attached processors.
Advanced Computer Architecture Dr Fathy
71
Parallel Computer Models
The NUMA (Uniform Menory Access) ModelA NUMA multiprocessor is a shared-memory system in which the
access time varies with the location of the memory word
Advanced Computer Architecture Dr Fathy
72
Parallel Computer Models
Besides distributed memories, globally shared memory can be added to multiprocessor system. In this case, there are three memory-access pattern: the fastest is local memory access. The next is global memory access. The slowest is access of remote memory as illustrated in this picture.
Advanced Computer Architecture Dr Fathy
73
Parallel Computer Models
The COMA ModelA multi processor using cache-only memory assumes the COMA model. This
model is depicted the following picture.
The COMA model is a special case of NUMA machine, in which the distributed main memories are converted to caches. There is no memory hierarchy at each processor node.
Besides the UMA, NUMA and coma models specified above, other variations exist for multiprocessors. For example, a cache-coherent non-uniform memory access (CC-NUMA) model can be specified with distributed shared memory and cache directories.
Advanced Computer Architecture Dr Fathy
74
Parallel Computer Models
Representative Multicomputers.Several commercialy multiprocessors are summerized in the following table:
Advanced Computer Architecture Dr Fathy
75
Parallel Computer Models
Distributed-Memory MultiprocessorA distributed-Memory Multiprocessor system is modeled in the
following figure. The system consists of multiple computers, often called nodes, interconnected by a message-passing network. Each node is an autonomous computer consisting of a processor, local memory, and sometimes attached disks or I/O peripherals.
The message-passing network provides point to point static connections among the nodes. All local memories are private and accessible only by local processors. For this resean, traditional multicomputers have been called no-remote-memory-access (NORMA) machines.
Advanced Computer Architecture Dr Fathy
76
Parallel Computer Models
Representative MulticomputersThree message-passing multicomputers are summarized in the following
table. With distributed processor/memory nodes, these machines are better in achieving a scalable performance. However, message passing imposes a hardship on programmers to distribute the computations and data sets over the nodes or to establish communication among nodes.
Advanced Computer Architecture Dr Fathy
77
Parallel Computer Models
A Taxonomy of MIMD Computers
Parallel computers appear as either SIMD or MIMD configurations. The SIMDs appeal more to special-purpose applications. It is clear that SIMDs are not size-scalable, but unclear whether large SIMDs are generation-scalable. The fact that CM-5 has an MIMD architecture, away from the SIMD architecture in CM-2, may shed some light on the architectural trend. Furthermore, the boundary between multiprocessors and multicomputers has become blurred in recent years, eventually, the distinction may vanish.
Advanced Computer Architecture Dr Fathy
78
Parallel Computer Models
Multivector and SIMD ComputersHere we introduce supercomputers and parallel processors for vector
processing and data parallelism. We classify supercomputers either as pipelined vector ,machines using a few powerful processors equipped with vector hardware, or as SIMD computers emphasizing massive data parallelism.
Vector SupercomputersA vector computer is often built on top of a scalar processor. As shown in following figure. The vector processor is attached to the scalar processor as an optional feature. Program and data are first loaded into the main memory thought a host computer. All instructions are first decoded by the scalar control operation, it will be directly executed by the scalar processor using the scalar functional pipelines.
Advanced Computer Architecture Dr Fathy
79
Parallel Computer Models
Representative supercomputersOver a dozen pipelined vector computers have been manufactured, renging
from workstations to mini- and supercomputers.
Advanced Computer Architecture Dr Fathy
80
Parallel Computer Models
SIMD SupercomputerYou know that an abstract
model of SIMD computers having a single instruction stream over multiple data stream. An operational model of SIMD computers is presented in the following figure.
Advanced Computer Architecture Dr Fathy
81
Parallel Computer Models
SIMD Machine Model An operational model of an SIMD computer is specified by a 5-tuple:
Where
(1) N is the number of processing elements (PEs) in the machine. For example Illiac IV has 64 PEs and connection Machine CM-2 uses 65,536 PEs.
(2) C is the set of instructions directly executed by the control unit (CU), including scalar and program flow control instructions.
(3) I is the set of instructions broadcast by the CU to al PEs for parallel execution. These include arithmetic, logic, data routing, masking, and other local operations executed by each active PE over data within that PE.
(4) M is the set of masking schemes, where each mask partitions the set of the PEs into enabled and disabled subsets.
(5) R is the set of data-routing functions, specifying various patterns to be set up in the interconnection network for inter-PE communications.
One can describe a particular SIMD machine architecture by specifying the 5-tuple.
},,,,{ RMICNM
Advanced Computer Architecture Dr Fathy
82
Parallel Computer Models
Representative SIMD ComputersThree SIMD supercomputers are summarized im the following table. The number
of PEs in these systems ranges from 4096 in the DAP610 to 16,384 in the MasPar MP-1 and 65,536 in CM-2. Both the CM-2 and DAP610 are fine-grain, bit-slice SIMD computers with attached floating-point accelerator for blocks of PEs.
Advanced Computer Architecture Dr Fathy
83
Parallel Computer Models
Architectural Development TracksThe architectures of most existing computers follow certain development
tracks. Understanding features of various tracks provides insights for new architectural development. We look into six tracks to be studied in later chapters. These tracks are distinguished by similarity in computational models and technological bases.
Multiple-processor Tracks generally speaking, a multiple-processor system can be either a shared-
memory multiprocessor or a distributed-memory multicomputer
Message-Passing Track
The Cosmic Cube pioneered the development of message-passing multicomputers.
Advanced Computer Architecture Dr Fathy
84
Parallel Computer Models
Shared-Memory Track The figure shows a track of multiprocessor development
employing a single address space in the entire system.
Advanced Computer Architecture Dr Fathy
85
Parallel Computer Models
Multivector Track
These are traditional vector supercomputers. The CDC7600 was the first vector dual-processor system.
Advanced Computer Architecture Dr Fathy
86
Parallel Computer Models
SIMD Track The Illiac IV pioneered the construction of SIMD computers,
even the array processor concept can be traced back for earlier to the 1960s.
Advanced Computer Architecture Dr Fathy
87
Parallel Computer Models
Multithreaded and Dataflow TracksThese are two research tracks that have been
mainly experimented with in laboratoriesMultithreading Track
Multithreading idea war pioneered by Burton Smith (1978) in the HELP system which extended the concept of scoreboarding of multiple functional units in the CDC6400.
Advanced Computer Architecture Dr Fathy
88
Parallel Computer Models
The Dataflow TrackThe key idea is to use a dataflow mechanism, instead of a
control-flow mechanism as in von Neumann machines, to direct the program flow. Fine-grain, instruction-level parallelism is exploited in dataflow computers.