Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | yvette-hyde |
View: | 43 times |
Download: | 0 times |
3
Why Parallel Computing ?
Easy to get huge computational problemsphysical simulation in 3D: 100 x 100 x
100 = 106
oceanography example: 48 M cells, several variables per cell, one time step = 30 Gflop = 30,000,000,000 floating point operations)
4
Why Parallel Computing ?
Numerical Prototyping:real phenomena are too complicated to model real experiments are too hard, too expensive, or
too dangerous for a laboratory: Examples: simulate aging effects on nuclear
weapon (ASCI Project), oil reservoir simulation, large wind tunnels, galactic evolution, whole factory or product life cycle design and optimization, DNA matching (bioinformatics)
6
An Example -- Climate Prediction
What is Climate?Climate (longitude, latitude, height, time)return a vector of 6 values:
• temperature, pressure, humidity, and wind velocity (3 words)
Discretize: only evaluate on a grid point;Climate(i, j, k, n), where
t = n*dt, dt is a fixed time step, n an integer, i,j,k are integers indexing the grid cells.
7
An Example -- Climate Prediction
Area: 3000 x 3000 miles, Height: 11 miles --- 3000x3000x11 cube mile domain
Segment size: 0.1x0.1x0.1 cube miles --1011 different segments
Two-day period, dt = 0.5 hours (2x24x2=96) 100 instructions per segment
the computation of parameters inside a segment uses the initial values and the values from neighboring segments
8
An Example -- Climate Prediction
A single updating of the parameters in the entire domain requires 1011x100, or 1013 instructions (10 Trillion instructions). Update 96 times -- 1015
instructions Single-CPU supercomputer:
1000 MHz RISC CPUExecution time: 280 hours.
??? Taking 280 hours to predict the weather for next 48 hours.
9
Issues in Parallel Computing
Design of Parallel Computers Design of Efficient Algorithms Methods for Evaluating Parallel Algorithms Parallel Programming Languages Parallel Programming Tools Portable Parallel Programs Automatic Programming of Parallel Computers
11
Design of Parallel Computer
Parallel computing is information processing that emphasizes the concurrent manipulation of data elements belonging to one or more processes solving a single problem [Quinn:1994]
Parallel computer: a multiple-processor computer capable of parallel computing.
12
Efficient Algorithms
Throughput: the number of results per second
Speedup: S = T1 / Tp
Efficiency = S/P (P= no. of processor)
13
Scalability
Algorithmic scalability: an algorithm is scalable if the available parallelism increases at least linearly with problem size.
Architectural scalability: an architecture is scalable if it continues to yield the same performance per processor, as the number of processors is increased and as the problem size is increased. Solve larger problems in the same amount of time by
buying a parallel computer with more processors. ($$$$$ ??)
14
Parallel Architectures
SMP: Symmetric Multiprocessor (SGI Power Challenger, SUN Enterprise 6000)
MPP: Massively Parallel Processors INTEL ASCI Red: 9152 processors (1997) SGI/Cray T3E120 LC1080-512 1080 nodes (1998)
Cluster: True distributed systems -- tightly-coupled software on a loosely-coupled (LAN-based) hardware. NOW: Network of Workstation or COW: Cluster of
Workstations, Pile-of-PC (PoPC)
15
Levels of Abstraction
Applications(Sequential ?)(Parallel ?)
Programming Models(Shared Memory ?)(Message Passing ?)
Addressing Space (Shared Memory?)
(Distributed Memory ?)
Hardware Architecture
17
A Simple Example
Take a paper and pen. Algorithm:
Step 1: Write a number on your pad Step 2: Compute the sum of your neighbor's values Step 3: Write the sum on the paper
30
1. Based on Control Mechanism
Flynn’s Classification : data or instruction stream : SISD: single instruction stream single data streamsSIMD: single instruction stream multiple data streamsMIMD: multiple instruction streams multiple data
streams MISD: multiple instruction stream single data stream
31
SIMD
Examples: Thinking Machines: CM-1, CM-2 MasPar MP-1 and MP-2
Simple processor: e.g., 1- or 4-bit CPU Fast global synchronization (global clock) Fast neighborhood communication Applications: image/signal processing, numerical
analysis, data compression,...
32
2. Based on Address-space organization
Bell’s Classification on MIMD architectureMessage-passing architecture
• local or private memory
• multicomputer = MIMD message-passing computer (or distributed-memory computer)
Shared-address-space architecture• hardware support for one-side communication
(read/write)
• multiprocessor = MIMD shared-address-space computers
33
Address Space
A region of a computer’s total memory within which addresses are continuous and may refer to one another directly by hardware.
A shared memory computer has only one user-visible address space
A disjoin memory computer can have several. Disjoint memory is more commonly called distributed
memory, but the memory of many shared memory computer (multiprocessors) is physically distributed.
34
Multiprocessors vs. Multicomputers
Shared-Memory Multiprocessors Models UMA: uniform memory access (all SMP servers)NUMA: nonuniform-memory-access (DASH, T3E)COMA: cache-only memory architecture (KSR)
Distributed-Memory Multicomputers Model message-passing networkNORMA model (no-remote-memory-access) IBM SP2, Intel Paragon, TMC CM-5, INTEL ASCI
Red, cluster
Symmetric Multiprocessors (SMPs) SGI PowerChallenge
Cluster: IBM PowerPC Clusters
Distributed Memory Machine IBM SP2
Parallel Computers at HKU
CYC 807
CYC LG 102.
Computer Center
36
Symmetric Multiprocessors (SMPs)
Processors are connected to a shared memory module through a shared bus
Each processor has equal right to access :• the shared memory• all I/O devices
A single copy of OS
37
P6 P6P6P6
PCI BridgeDRAM
ControllerDataPath
MICMIC MICMIC MICMIC MICMIC
Memory controller
MIC: memory interface controller
NIC
PCI Bus
PCIDevice
PCIDevice
To Network
Pentium Pro processor bus
Interleave data(288 bits)
32-bit address64-bit data533 MB/s
32-bit address32-bit data132 MB/s
Mem data(72 bits)
CYC414 SRG Lab.
38
SMP MachineSMP MachineSMP MachineSMP Machine
SGI POWER CHALLENGESGI POWER CHALLENGE
POWER CHALLENGE XL– 2-36 CPUs – 16 GB memory (for 36 CPUs)– The bus performance: up to 1.2GB/sec
Runs on a 64 bits OS (IRIX6.2) Common memory is shared which suitable for
single-address-space programming
Distributed Memory Machine Consists of multiple computers (nodes) Nodes are communicated by message
passing Each node is an autonomous computer
• Processor(s) (may be an SMP)• Local memory• Disks, network adapter, and other I/O peripherals
No-remote-memory-access (NORMA)
40
Distributed Memory MachineDistributed Memory MachineDistributed Memory MachineDistributed Memory Machine
IBM SP2
SP2 => Scalable POWERparallel SystemDeveloped based on RISC System/6000
workstationPower 2 processor, 66.6 MHz, 266 MFLOP
42
8x8 Switch8x8 Switch
Switch among the nodes simultaneously and quickly Maximum 40MB point-to-point bandwidth
SP2 - High Performance Switch
43
SP2 - Nodes (POWER 2 processor)
Two types of nodes:– Thin node (smaller capacity, used to process
individual works) 4 micro-channel slots, 96KB cache, 64-512MB memory, 1-4 GB disk
– Wide node (larger capacity, used to be servers of the system) 8 micro-channel slots, 288KB cache, 64-2048MB memory, 1-8 GB disk
44
SP2
– The largest SP (P2SC, 120 MHz) machine: Pacific Northwest National Lab. U.S., 512 processors, TOP 26, 1998.
45
What’s a Cluster ?
A cluster is a group of whole computers that works cooperatively as a single system to provide fast and efficient computing service.
47
Clusters
AdvantagesCheaper Easy to scaleCoarse-grain parallelism
DisadvantagesPoor communication performance (typically
the latency) as compared with other parallel systems
48
TOP 500 (1997)
TOP 1 INTEL: ASCI Red at Sandia Nat’l Lab. USA, June 1997
TOP 2 Hitachi/Tsukuba: CP-PACS (2048 processors), 0.368 Tflops at Univ. Tsukuba Japan, 1996
TOP 3 SGI/Gray: T3E 900 LC696-128 (696 processors), 0.264 Tflops at UK, Meteorological Office UK, 1997
49
TOP 500 (June, 1998)
TOP 1 INTEL: ASCI Red (9152 Pentium Pro processors, 200 MHz), 1.3 Teraflops at Sandia Nat’l Lab. U.S., since June 1997
TOP 2 SGI/Gray: T3E 1200 LC1080-512, 1080 processors, 0.891 Tflops, U.S. government, 1998 installed
TOP 3. SGI/Cray: T3E900 LC1248-128, 1248 processors, 0.634 Tflops, U.S. government
50
INTEL ASCI Red
Compute node: 4536 (Dual Pentium Pro 200 MHzsharing a 533 MB/s bus)
•Peak speed.: 1.8 Teraflops (Trillion: 10 12)•1,600 square feet
85 cabinets
51
INTEL ASCI Red (Network)
Split 2-D Mesh Interconnect
Node-to-node bidirectional bandwidth: 800 Mbytes/sec
10 times faster than SP2 (one-way:40 MB/s)
52
Cray T3E 1200
Processor performance: 600 MHz, 1200 Mflops Overall system peak performance: 7.2 gigaflops
to 2.5 teraflops, scale to thousands of processors. Interconnect: a three-dimensional bidirectional
torus (a peak interconnect speed of 650MB/sec) Cray UNICOS/mk distribute OS Scalable GigaRing I/O system
56
TOP 500 (Asia)
1996: Japan:
• (1) SR2201/1024 (1996)
Taiwan: • (76) SP2/80
Korea: • (97) Cray Y-MP/16
China: • (231) SP2/32
Hong Kong: • (232) -- SP2/32 (HKU/CC)
1997: Japan:
• (2)CP-PACS/2048 (1996)• (5) SR2201/1024 (1996)
Korea:• (34)T3E 900 LC128-128• (154) Ultra HPC 1000
Taiwan: • (167) SP2/80
Hong Kong:• (426) SGI Origin 2000 (CU)
(500): SP2/38 (UCLA)
57
TOP500 Asia 1998
Japan: TOP 6: CP-PACS/2048 (1997, TOP 2) TOP 12: NEC SX-4/128M4 TOP 13: NEC SX-4/128H4 TOP 14: Hitachi SR2201/1024 …more
Korea: SGI/Cray T3E900 LC128-128 (TOP 52),... Taiwan: IBM SP2/110, 1998 (TOP 241)