Concepts of Parallel Programming

transcript

8/9/2019 Concepts of Parallel Programming

1/49

Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 1

Concepts ofConcepts of

Parallel ComputingParallel ComputingAlf Wachsmann

Stanford Linear Accelerator Center (SLAC)alfw@slac.stanford.edu
mailto:alfw@slac.stanford.edumailto:alfw@slac.stanford.edu


2/49


Why do it in parallel?

Why is parallel computing a good idea? 1 worker needs 3 days to dig a ditch.

How long do 3 workers need?

Parallel Computing is (in the most general sense) thesimultaneous use of multiple compute resources tosolve a computational problem

What about 1 tree takes 30 years to grow big.

How long do 3 trees need?


3/49


Parallel Addition

Diagram in space and time Abstraction from communication (the hard part!)

3 7 11 15 19 23 27 311+2 3+4 5+6 7+8 9+10 11+12 13+14 15+16

10 26 42 58

36 100

wallclocktime

1 2 3 4 5 6 7 8Processors

136


4/49


Why do it in parallel?

Algorithmic reasons: Save time (wall clock time) does NOT save work!

Solve larger problems (more memory)

Systemic reasons:

Transmission speed (speed of light) Limits to miniaturization

Economic limits


5/49


Maximum Gain

Gain by doing it in parallel is

speedup =running time for best serial algorithm

running time for parallel algorithm

Ideally: use P processors and get P-fold speedup.

Linear speedup in P is the best we can hope for!

There are cases of super-linear speedup.


6/49


Sequential Computer

Architecture of serial computers:

Fetch Execute

CPU

Memory

VonNeuman Architecture: memory is used to store both program and data CPU gets instructions and/or data from memory Decodes instructions Executes them sequentially


7/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 7

Parallel Computers

Widely used classification for parallel computers:Flynn's Taxonomy (1966)

S I S D

Single Instruction, Single Data

S I M D

Single Instruction, Multiple Data

M I S D

Multiple Instruction, Single Data

M I M D

Multiple Instruction, Multiple Data



Memory Architectures

Other important classification schema is accordingthe parallel computer's memory architecture Shared memory

Uniform memory access

Non-uniform memory access

Distributed memory

Hybrid distributed-shared memory solutions



Shared Memory

Shared Memory Multiple processors can operate independently but share

the same memory resources

Changes in a memory location effected by one processorare visible to all other processors (global address space)

Memory

CPU

CPU

CPU CPU



Uniform Memory Access

Most commonly represented today by SymmetricMultiprocessor (SMP) machines

Identical processors

Equal access and access times to memory

Sometimes called CC-UMA - Cache Coherent UMA.Cache Coherent means if one processor updates alocation in shared memory, all the other processorsknow about the update. Cache coherency is

accomplished at the hardware level.



Non-Uniform Memory Access

Often made by physically linking two or more SMPs One SMP can directly access memory of another SMP

Not all processors have equal access time to allmemories

Memory access across link is slower If cache coherency is maintained, then may also becalled CC-NUMA - Cache Coherent NUMA



Distributed Memory

Processors have their own local memory. Memoryaddresses in one processor do not map to anotherprocessor, so there is no concept of global addressspace across all processors

Distributed memory systems require acommunication network to connect inter-processormemory

The network "fabric" used for data transfer varieswidely; can can be as simple as Ethernet

MemoryCPU

MemoryCPU

MemoryCPU

MemoryCPU

Network

Node 1 Node 2

Node 3 Node 4



Comparison

Shared Memory Advantages

Global address space

Data sharing betweentasks is both fast and

uniform Disadvantages

Lack of scalabilitybetween memory andCPUs.

Programmerresponsibility forsynchronization

Expensive

Distributed Memory Advantage

Memory is scalable withnumber of processor

Each processor can

rapidly access ownmemory

Disadvantages NUMA access times

Programmer responsible

for many details Difficult to map existing

data structures


14/49


Constellations

Hybrid Distributed-Shared Memory Used in most of todays parallel computers

Cache-coherent SMP nodes

Distributed memory is networking of multiple SMP nodes

Network

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPU

CPUCPU

MemoryCPUCPU

CPUCPU

Node 1

Node 3

Node 2

Node 4


15/49


Example Machines

Comparison of Shared and Distributed Memory ArchitecturesArchitecture CC-UMA CC-NUMA Distributed

Examples

SMPsSun Fire Exxx/VxxxDEC/CompaqSGI Challenge

IBM POWER3

SGI Origin/AltixSequentHP ExemplarDEC/Compaq

IBM POWER4

Cray T3EMasparIBM SPIBM Blue Gene/L

Beowulf Clusters

Communications

MPIThreadsOpenMPshmem

MPIThreadsOpenMPshmem

MPI

Scalability to 10s of processors to 100s of processors to 1000s of processors

Draw BacksLimited memory

bandwidth

New architecture Point-to-pointcommunication

System administrationProgramming is hard todevelop and maintain

Software Availability declining stable Still rising


16/49


Parallel Programming Models

Abstraction above hardware and memoryarchitecture

Several programming models in use: Shared Memory (parallel computing)

Threads Message Passing (distributed computing)

Data Parallel

Hybrid approaches

All models exist for all hardware/memoryarchitectures


17/49


Shared Memory Model

Tasks share a common address space, which theyread and write asynchronously

Access control to shared memory via locks orsemaphores

No notion of ownership of data no need toexplicitly communicate data between tasks

Implementations shared memory machines: compiler

distributed memory machines: simulations


18/49


Threads Model

A single process has multiple, concurrent executionpaths

Most commonly used on shared mem. machines andin operating systems

call sub1call sub2do i = 1, nA(i) = fnct(i^3)

B(i) = A(i) * pend do

call sub3call sub4......

prg.exe T1 T2

T3T4 T

ime


19/49


Threads Model

Implementations POSIX Thread Library C language only

Offered for most hardware

Very explicit parallelism

Requires significant programmer attention to detail OpenMP

Based on compiler directives; can use sequential code

Fortran, C, C++

portable/multi-platform

Can be very easy and simple to use


20/49


Message Passing Model

Tasks exchange data through communications bysending and receiving messages

usually requires cooperative operations to beperformed by each process: a send operation must

have a matching receive operation

task 0 task 1

Machine A Machine B

data dataNetwork

send(data) receive(data)


21/49


Message Passing Model

Implementations Parallel Virtual Machine (PVM) Not much in use any more

Message Passing Interface (MPI) Part 1 released 1994

Part 2 (MPI-2) release 1996 http://www-unix.mcs.anl.gov/mpi/

Now de-facto standard

Fortran, C, C++

Available on virtually all machines OpenMPI, MPICH, LAM/MPI, many vendor specific versions

On shared memory machines, MPI implementations usuallydon't use a network for task communications
http://www-unix.mcs.anl.gov/mpi/http://www-unix.mcs.anl.gov/mpi/


22/49


Data Parallel Model

A set of tasks work collectively on the same datastructure

Each task works on a different partition of thesame data structure


23/49


Data Parallel Model

Implementations Fortran 90 ISO/ANSI extension of Fortran 77

Additions to program structure and commands

Variable additions methods and arguments

High Performance Fortran (HPF) Contains everything in F90

Directives to tell compiler how to distribute data added

Data parallel constructs added (now part of F95)

On distr. memory machines: translated into MPI code


24/49


Hybrid Programming Models

Two or more of the previous models are used in thesame program

Common examples: POSIX Threads and Message Passing (MPI)

OpenMP and MPI ClusterOpenMP (Intel)

Works well on network of SMP machines

Also used:

Data Parallel and MPI


25/49


Designing Parallel Programs

No real parallelizing compilers Compiler knows how to parallelize certain constructs (e.g.loops)

Compiler uses directives from programmer

Not simply a matter of taking sequential algorithmand making it parallel. Sometimes, completelydifferent algorithmic approach necessary

Very time consuming and labor intense task


26/49


Parallelization Techniques

Domain Decomposition Data is partitioned Each task works on different part of data

Three different ways to partition data


27/49


Parallelization Techniques

Functional Decomposition Problem is partitioned into set of independent tasks

Both types of decomposition can be and often are combined


28/49


A little Theory

Some problems can be parallelized very well:In complexity theory, the classNC ("Nick'sClass") is the set of decision problems decidablein poly-logarithmic time on a parallel computer

with a polynomial number of processors. In otherwords, a problem is inNC if there are constants cand ksuch that it can be solved in timeusing parallel processors.

Source: http://en2.wikipedia.org/wiki/Class_NC

O lognc

Onk
http://en2.wikipedia.org/wiki/Class_NChttp://en2.wikipedia.org/wiki/Class_NC


29/49


A little Theory

Some problems can't be parallelized at all! Example: Calculating the Fibonacci Sequence

(1,1,2,3,5,8,13,21,...) by using the formula

Calculation entails dependent calculations: Thecalculation of the k + 2 value uses those of bothk + 1 and k. These three terms cannot be calculatedindependently and therefore, cannot be parallelized.

F1=1

F2=1

Fk2=FkFk1


30/49


Communication

Decomposed problems typically need to communicate: Partial results need to be combined Changes to neighboring data have effects on a task's data

Some problem don't need communication:

Embarrassingly parallel problems


31/49


Cost of Communication

Communicating data takes time Inter-task comm. has overhead Often synchronization is necessary

Communication is much more expensive than

computation Communicating data needs to save a lot of computationbefore it pays off

Infiniband needs < 10ms to set up communication

2.4GHz AMD Opteron CPU needs ~0.4ns to perform one

floating point operation (Flop) 25,000 floating point operations per communication setup!


32/49


Latency - Bandwidth

Latency: the amount of time for the first bit ofdata to arrive at the other end

Bandwidth: how much data per time unit fitsthrough


33/49


Cost of Communication

Formula for the time needed to transmit data

cost=LN

B

L = Latency [s]N = number of bytes [byte]B = Bandwidth [byte/s]cost [s]


34/49


Visibility of Communication

With MPI, communication is explicit and very visible Latency Hiding: Communicate and at the same time doing some other

computations

Implementation via parallel threads or non-blocking MPIcommunication functions

Makes programs faster but more complex


35/49


Scope of Communication

Knowing which tasks must communicate with eachother is critical during the design stage of a parallelprogram Point-to-Point: involves two tasks with one task acting as

the sender/producer of data, and the other acting as the

receiver/consumer Collective: involves data

sharing between more thantwo tasks, which are oftenspecified as being

members in a commongroup, or collective


36/49


Communication Hardware

Architecture Comment Bandwidth Latency

Myrinet

http://www.myricom.com/

Proprietary

but

commodity

Sust. one-way for

large messages:

~1.2GB/s

short

messages:

~3ms

Infiniband

http://www.infinibandta.org/

Vendor

indep.

standard

~900MB/s(4x HCAs)

~10ms

Quadrics (QsNet)

http://www.quadrics.com/

Expensive,

proprietary

~900MB/s ~2ms

Gigabit Ethernet commodity ~100MB/s ~60ms

Custom: SGI, IBM, Cray, Sun, Compaq, ...


37/49


Communication Hardware

InfiniBand Proprietary GigE 10GigE

MellanoxMHGA28

QLogicInfiniPath

HT

MyrinetF

Myrinet10G

QuadricsQM500

ChelsioT210-CX

Latency (s) 2.25 1.3 2.6 2.0 1.6 30-100 9.6

Peak Band-width (MB/s)

1502 954 493 1200 910 125 860

N/2 (Bytes)BW (MB/s)

512750

385470

2000250

2000600

1000450

800060

100,000430

CPUoverhead (%)

~5 ~40 ~10 ~10 ~50 >50 ~50

*Mellanox Technology testing; Ohio State University; PathScale, Myricom, Quadrics, and Chelsio websites

http://www.mellanox.com/applications/performance_benchN/2: Message size to achieve half the peak bandwidth
http://www.mellanox.com/applications/performance_benchmarks.phphttp://www.mellanox.com/applications/performance_benchmarks.php


38/49


Synchronization

handshaking between tasks that are sharing data Types of synchronization: Barrier

Usually implies that all tasks are involved

Each task performs its work until it reaches the barrier.

It then stops, or "blocks" When the last task reaches the barrier, all tasks are

synchronized

Used in MPI


39/49


Synchronization

More types: Lock/Semaphore Can involve any number of tasks

Typically used to serialize (protect) access to global data or asection of code. Only one task at a time may use (own) the

lock / semaphore / flag The first task to acquire the lock "sets" it. This task canthen safely (serially) access the protected data or code.

Other tasks can attempt to acquire the lock but must waituntil the task that owns the lock releases it.

Can be blocking or non-blocking

Used in threads and shared memory


40/49


Synchronization

More types: Synchronous Communication Operations Involves only those tasks executing a communication

operation

When a task performs a communication operation, some form

of coordination is required with the other task(s)participating in the communication. For example, before atask can perform a send operation, it must first receive anacknowledgment from the receiving task that it is OK tosend.


41/49


Granularity

Qualitative measure ofComputation / Communication Ratio Typically, periods of computations are separated from

periods if communication by synchronization events

Fine-Grain Parallelism:Small amount of

computation betweencommunication

Coarse-Grain Parallelism:Large amount of

computation betweencommunication


42/49


Granularity

Fine-Grain Low computation tocommunication ratio

Facilitates load balancing

High communication

overhead; less opportunityfor performanceenhancement

Coarse-Grain High computation tocommunication ratio

More opportunity forperformance increase

Harder to load balanceefficiently


43/49


Data In- and Output

Parallel computers with thousands of nodes canhandle huge amounts of data

It is hard to get this data in and out of the nodes parallel-I/O systems are still fairly new and not available

for all platforms

I/O over the network (like NFS) causes severe bottlenecks

Help can be found with Parallel File Systems: Lustre, PVFS2, GPFS (IBM)

MPI-2 provides support for parallel file systems

Rule #1: Reduce overall I/O as much as possible!


44/49


Efficiency

Speedup

Value between zero and one

estimate how well-utilized the processors are in solving theproblem, compared to how much effort is wasted incommunication and synchronization

linear speedup and algorithms running on a single processorhave an efficiency of 1

many difficult-to-parallelize algorithms have efficiencysuch as 1/log p that approaches zero as the number ofprocessors increases

Sp =Ts

Tp

Efficiency =Sp

p


45/49


Limits and Costs

Besides theoretical limits and hardware limits,there are practical limits to parallel computing

Amdahl's Law states that potential programspeedup is defined by the fraction of code (P) thatcan be parallelized:

If none of the code can be parallelized,P= 0 and the speedup = 1 (no speedup).If all of the code is parallelized,P= 1 and thespeedup is infinite (in theory).

If 50% of the code can be parallelized, maximumspeedup = 2, meaning the code will run twice as fast.

speedup=

1

1P


46/49


Limits and Costs

Introducing the number of processors performingthe parallel fraction of work, Amdahl's Law can bereformulated as

speedup=1

P

NS

N = number of processors,

P = parallel fraction andS=1-P = serial fraction

Speedup

http://upload.w

ikimedia.org/wikipedia/en

/7/7a/Amdahl-law.jpg

N P=0.50 P=0.90 P=0.99 P=1.0

10 1.82 5.26 9.17 10100 1.98 9.17 50.25 100

1000 1.99 9.91 90.99 1000

10000 1.99 9.99 99.02 10000


47/49


Typical Parallel Applications

Applications that are well suited for parallelcomputers are Weather and ocean patterns

Finite Element Method (FEM; crash tests for cars)

Fluid dynamics, aerodynamics

Simulation of electro-magnetic problems


48/49


Summary

Overview of parallel computing concepts Hardware Software

Programming

Problems of parallel computing Communication is expensive (latency)

I/O is expensive

Techniques to work around these problems Problem decomposition (communicate larger data)

Parallel File Systems plus supporting hardware

$$$$ (faster communication fabric)


49/49

Acknowledgment/References

Most of this talk is taken fromhttp://www.llnl.gov/computing/tutorials/parallel_comp/ Theory book Introduction to Parallel Algorithms and

Architectures: Arrays, Trees, Hypercubes by F.Thomson Leighton

Hardware book Computer Architecture: AQuantitative Approach (3rd edition) by John L.Hennessy, David A. Patterson, David Goldberg

http://www.top500.org/
http://www.llnl.gov/computing/tutorials/parallel_comp/http://www.top500.org/http://www.top500.org/http://www.llnl.gov/computing/tutorials/parallel_comp/

Concepts of Parallel Programming

Documents