Lecture 1 Parallel Processing for Scientific Applications

1

Lecture 1Parallel Processing for Scientific Applications

2

Parallel Computing

Multiple processes cooperating to solve a single problem

3

Why Parallel Computing ?

Easy to get huge computational problemsphysical simulation in 3D: 100 x 100 x

100 = 106

oceanography example: 48 M cells, several variables per cell, one time step = 30 Gflop = 30,000,000,000 floating point operations)

4

Why Parallel Computing ?

Numerical Prototyping:real phenomena are too complicated to model real experiments are too hard, too expensive, or

too dangerous for a laboratory: Examples: simulate aging effects on nuclear

weapon (ASCI Project), oil reservoir simulation, large wind tunnels, galactic evolution, whole factory or product life cycle design and optimization, DNA matching (bioinformatics)

5

An Example -- Climate Prediction

Grid point

6


What is Climate?Climate (longitude, latitude, height, time)return a vector of 6 values:

• temperature, pressure, humidity, and wind velocity (3 words)

Discretize: only evaluate on a grid point;Climate(i, j, k, n), where

t = n*dt, dt is a fixed time step, n an integer, i,j,k are integers indexing the grid cells.

7


Area: 3000 x 3000 miles, Height: 11 miles --- 3000x3000x11 cube mile domain

Segment size: 0.1x0.1x0.1 cube miles --1011 different segments

Two-day period, dt = 0.5 hours (2x24x2=96) 100 instructions per segment

the computation of parameters inside a segment uses the initial values and the values from neighboring segments

8


A single updating of the parameters in the entire domain requires 1011x100, or 1013 instructions (10 Trillion instructions). Update 96 times -- 1015

instructions Single-CPU supercomputer:

1000 MHz RISC CPUExecution time: 280 hours.

??? Taking 280 hours to predict the weather for next 48 hours.

9

Issues in Parallel Computing

Design of Parallel Computers Design of Efficient Algorithms Methods for Evaluating Parallel Algorithms Parallel Programming Languages Parallel Programming Tools Portable Parallel Programs Automatic Programming of Parallel Computers

10

Some Basic Studies

11

Design of Parallel Computer

Parallel computing is information processing that emphasizes the concurrent manipulation of data elements belonging to one or more processes solving a single problem [Quinn:1994]

Parallel computer: a multiple-processor computer capable of parallel computing.

12

Efficient Algorithms

Throughput: the number of results per second

Speedup: S = T1 / Tp

Efficiency = S/P (P= no. of processor)

13

Scalability

Algorithmic scalability: an algorithm is scalable if the available parallelism increases at least linearly with problem size.

Architectural scalability: an architecture is scalable if it continues to yield the same performance per processor, as the number of processors is increased and as the problem size is increased. Solve larger problems in the same amount of time by

buying a parallel computer with more processors. ($$$$$ ??)

14

Parallel Architectures

SMP: Symmetric Multiprocessor (SGI Power Challenger, SUN Enterprise 6000)

MPP: Massively Parallel Processors INTEL ASCI Red: 9152 processors (1997) SGI/Cray T3E120 LC1080-512 1080 nodes (1998)

Cluster: True distributed systems -- tightly-coupled software on a loosely-coupled (LAN-based) hardware. NOW: Network of Workstation or COW: Cluster of

Workstations, Pile-of-PC (PoPC)

15

Levels of Abstraction

Applications(Sequential ?)(Parallel ?)

Programming Models(Shared Memory ?)(Message Passing ?)

Addressing Space (Shared Memory?)

(Distributed Memory ?)

Hardware Architecture

16

Is Parallel Computing Simple ?

17

A Simple Example

Take a paper and pen. Algorithm:

Step 1: Write a number on your pad Step 2: Compute the sum of your neighbor's values Step 3: Write the sum on the paper

18

** Questions 1

How do you get values from your neighbors?

19

Shared Memory Model

5, 0, 4

20

Message Passing Model

Hey !! What’s your number ?

21

** Questions 2

Are you sure the sum is correct ?

22

Some processor starts earlier

5+0+4 = 9

23

Synchronization Problem !!

9

9+5+0=14

Step 3.

Step 2.

24

** Questions 3

How do you decide when you are done? (throw away the paper)

25

Some processor finished earlier

5+0+4 = 9

26


9

27


Sorry !!We closed !!

28


Sorry !!We closed !!

?+5+0=?

Step 2.

29

Classification of Parallel Architectures

30

1. Based on Control Mechanism

Flynn’s Classification : data or instruction stream : SISD: single instruction stream single data streamsSIMD: single instruction stream multiple data streamsMIMD: multiple instruction streams multiple data

streams MISD: multiple instruction stream single data stream

31

SIMD

Examples: Thinking Machines: CM-1, CM-2 MasPar MP-1 and MP-2

Simple processor: e.g., 1- or 4-bit CPU Fast global synchronization (global clock) Fast neighborhood communication Applications: image/signal processing, numerical

analysis, data compression,...

32

2. Based on Address-space organization

Bell’s Classification on MIMD architectureMessage-passing architecture

• local or private memory

• multicomputer = MIMD message-passing computer (or distributed-memory computer)

Shared-address-space architecture• hardware support for one-side communication

(read/write)

• multiprocessor = MIMD shared-address-space computers

33

Address Space

A region of a computer’s total memory within which addresses are continuous and may refer to one another directly by hardware.

A shared memory computer has only one user-visible address space

A disjoin memory computer can have several. Disjoint memory is more commonly called distributed

memory, but the memory of many shared memory computer (multiprocessors) is physically distributed.

34

Multiprocessors vs. Multicomputers

Shared-Memory Multiprocessors Models UMA: uniform memory access (all SMP servers)NUMA: nonuniform-memory-access (DASH, T3E)COMA: cache-only memory architecture (KSR)

Distributed-Memory Multicomputers Model message-passing networkNORMA model (no-remote-memory-access) IBM SP2, Intel Paragon, TMC CM-5, INTEL ASCI

Red, cluster

Symmetric Multiprocessors (SMPs) SGI PowerChallenge

Cluster: IBM PowerPC Clusters

Distributed Memory Machine IBM SP2

Parallel Computers at HKU

CYC 807

CYC LG 102.

Computer Center

36

Symmetric Multiprocessors (SMPs)

Processors are connected to a shared memory module through a shared bus

Each processor has equal right to access :• the shared memory• all I/O devices

A single copy of OS

37

P6 P6P6P6

PCI BridgeDRAM

ControllerDataPath

MICMIC MICMIC MICMIC MICMIC

Memory controller

MIC: memory interface controller

NIC

PCI Bus

PCIDevice

PCIDevice

To Network

Pentium Pro processor bus

Interleave data(288 bits)

32-bit address64-bit data533 MB/s

32-bit address32-bit data132 MB/s

Mem data(72 bits)

CYC414 SRG Lab.

38

SMP MachineSMP MachineSMP MachineSMP Machine

SGI POWER CHALLENGESGI POWER CHALLENGE

POWER CHALLENGE XL– 2-36 CPUs – 16 GB memory (for 36 CPUs)– The bus performance: up to 1.2GB/sec

Runs on a 64 bits OS (IRIX6.2) Common memory is shared which suitable for

single-address-space programming

Distributed Memory Machine Consists of multiple computers (nodes) Nodes are communicated by message

passing Each node is an autonomous computer

• Processor(s) (may be an SMP)• Local memory• Disks, network adapter, and other I/O peripherals

No-remote-memory-access (NORMA)

40

Distributed Memory MachineDistributed Memory MachineDistributed Memory MachineDistributed Memory Machine

IBM SP2

SP2 => Scalable POWERparallel SystemDeveloped based on RISC System/6000

workstationPower 2 processor, 66.6 MHz, 266 MFLOP

41

SP2 - Message Passing

42

8x8 Switch8x8 Switch

Switch among the nodes simultaneously and quickly Maximum 40MB point-to-point bandwidth

SP2 - High Performance Switch

43

SP2 - Nodes (POWER 2 processor)

Two types of nodes:– Thin node (smaller capacity, used to process

individual works) 4 micro-channel slots, 96KB cache, 64-512MB memory, 1-4 GB disk

– Wide node (larger capacity, used to be servers of the system) 8 micro-channel slots, 288KB cache, 64-2048MB memory, 1-8 GB disk

44

SP2

– The largest SP (P2SC, 120 MHz) machine: Pacific Northwest National Lab. U.S., 512 processors, TOP 26, 1998.

45

What’s a Cluster ?

A cluster is a group of whole computers that works cooperatively as a single system to provide fast and efficient computing service.

Switched Ethernet

Node 1 Node 2 Node 3 Node 4

I need variable A from Node 2! OK!Thank You!

47

Clusters

AdvantagesCheaper Easy to scaleCoarse-grain parallelism

DisadvantagesPoor communication performance (typically

the latency) as compared with other parallel systems

48

TOP 500 (1997)

TOP 1 INTEL: ASCI Red at Sandia Nat’l Lab. USA, June 1997

TOP 2 Hitachi/Tsukuba: CP-PACS (2048 processors), 0.368 Tflops at Univ. Tsukuba Japan, 1996

TOP 3 SGI/Gray: T3E 900 LC696-128 (696 processors), 0.264 Tflops at UK, Meteorological Office UK, 1997

49

TOP 500 (June, 1998)

TOP 1 INTEL: ASCI Red (9152 Pentium Pro processors, 200 MHz), 1.3 Teraflops at Sandia Nat’l Lab. U.S., since June 1997

TOP 2 SGI/Gray: T3E 1200 LC1080-512, 1080 processors, 0.891 Tflops, U.S. government, 1998 installed

TOP 3. SGI/Cray: T3E900 LC1248-128, 1248 processors, 0.634 Tflops, U.S. government

50

INTEL ASCI Red

Compute node: 4536 (Dual Pentium Pro 200 MHzsharing a 533 MB/s bus)

•Peak speed.: 1.8 Teraflops (Trillion: 10 12)•1,600 square feet

85 cabinets

51

INTEL ASCI Red (Network)

Split 2-D Mesh Interconnect

Node-to-node bidirectional bandwidth: 800 Mbytes/sec

10 times faster than SP2 (one-way:40 MB/s)

52

Cray T3E 1200

Processor performance: 600 MHz, 1200 Mflops Overall system peak performance: 7.2 gigaflops

to 2.5 teraflops, scale to thousands of processors. Interconnect: a three-dimensional bidirectional

torus (a peak interconnect speed of 650MB/sec) Cray UNICOS/mk distribute OS Scalable GigaRing I/O system

53

Cray T3E Interconnect 3-D Torus

54

CP-PACS/2048, Japan

Peak Perf. 0.614 TFLOPS

CPU: PA-RISC 1.1, 150 MHz

55

CP-PACS Interconnect

Comm. Bandwidth: 300 MB/s per link

56

TOP 500 (Asia)

1996: Japan:

• (1) SR2201/1024 (1996)

Taiwan: • (76) SP2/80

Korea: • (97) Cray Y-MP/16

China: • (231) SP2/32

Hong Kong: • (232) -- SP2/32 (HKU/CC)

1997: Japan:

• (2)CP-PACS/2048 (1996)• (5) SR2201/1024 (1996)

Korea:• (34)T3E 900 LC128-128• (154) Ultra HPC 1000

Taiwan: • (167) SP2/80

Hong Kong:• (426) SGI Origin 2000 (CU)

(500): SP2/38 (UCLA)

57

TOP500 Asia 1998

Japan: TOP 6: CP-PACS/2048 (1997, TOP 2) TOP 12: NEC SX-4/128M4 TOP 13: NEC SX-4/128H4 TOP 14: Hitachi SR2201/1024 …more

Korea: SGI/Cray T3E900 LC128-128 (TOP 52),... Taiwan: IBM SP2/110, 1998 (TOP 241)

58

More Information

TOP500: http://www.top500.org/ ASCI Red:

http://www.sandia.gov/ASCI/Red.htm Cray T3E 1200:

http://www.cray.com/products/systems/crayt3e/1200/

Chapter: 1.2-1.4, 2.1-2.4.1

Date post:	01-Jan-2016
Category:	Documents
Upload:	yvette-hyde
View:	43 times
Download:	0 times

Lecture 1 Parallel Processing for Scientific Applications

Documents