Post on 30-May-2018
transcript
8/9/2019 Concepts of Parallel Programming
1/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 1
Concepts ofConcepts of
Parallel ComputingParallel ComputingAlf Wachsmann
Stanford Linear Accelerator Center (SLAC)alfw@slac.stanford.edu
mailto:alfw@slac.stanford.edumailto:alfw@slac.stanford.edu8/9/2019 Concepts of Parallel Programming
2/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 2
Why do it in parallel?
Why is parallel computing a good idea? 1 worker needs 3 days to dig a ditch.
How long do 3 workers need?
Parallel Computing is (in the most general sense) thesimultaneous use of multiple compute resources tosolve a computational problem
What about 1 tree takes 30 years to grow big.
How long do 3 trees need?
8/9/2019 Concepts of Parallel Programming
3/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 3
Parallel Addition
Diagram in space and time Abstraction from communication (the hard part!)
3 7 11 15 19 23 27 311+2 3+4 5+6 7+8 9+10 11+12 13+14 15+16
10 26 42 58
36 100
wallclocktime
1 2 3 4 5 6 7 8Processors
136
8/9/2019 Concepts of Parallel Programming
4/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 4
Why do it in parallel?
Algorithmic reasons: Save time (wall clock time) does NOT save work!
Solve larger problems (more memory)
Systemic reasons:
Transmission speed (speed of light) Limits to miniaturization
Economic limits
8/9/2019 Concepts of Parallel Programming
5/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 5
Maximum Gain
Gain by doing it in parallel is
speedup =running time for best serial algorithm
running time for parallel algorithm
Ideally: use P processors and get P-fold speedup.
Linear speedup in P is the best we can hope for!
There are cases of super-linear speedup.
8/9/2019 Concepts of Parallel Programming
6/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 6
Sequential Computer
Architecture of serial computers:
Fetch Execute
CPU
Memory
VonNeuman Architecture: memory is used to store both program and data CPU gets instructions and/or data from memory Decodes instructions Executes them sequentially
8/9/2019 Concepts of Parallel Programming
7/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 7
Parallel Computers
Widely used classification for parallel computers:Flynn's Taxonomy (1966)
S I S D
Single Instruction, Single Data
S I M D
Single Instruction, Multiple Data
M I S D
Multiple Instruction, Single Data
M I M D
Multiple Instruction, Multiple Data
8/9/2019 Concepts of Parallel Programming
8/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 8
Memory Architectures
Other important classification schema is accordingthe parallel computer's memory architecture Shared memory
Uniform memory access
Non-uniform memory access
Distributed memory
Hybrid distributed-shared memory solutions
8/9/2019 Concepts of Parallel Programming
9/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 9
Shared Memory
Shared Memory Multiple processors can operate independently but share
the same memory resources
Changes in a memory location effected by one processorare visible to all other processors (global address space)
Memory
CPU
CPU
CPU CPU
8/9/2019 Concepts of Parallel Programming
10/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 10
Uniform Memory Access
Most commonly represented today by SymmetricMultiprocessor (SMP) machines
Identical processors
Equal access and access times to memory
Sometimes called CC-UMA - Cache Coherent UMA.Cache Coherent means if one processor updates alocation in shared memory, all the other processorsknow about the update. Cache coherency is
accomplished at the hardware level.
8/9/2019 Concepts of Parallel Programming
11/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 11
Non-Uniform Memory Access
Often made by physically linking two or more SMPs One SMP can directly access memory of another SMP
Not all processors have equal access time to allmemories
Memory access across link is slower If cache coherency is maintained, then may also becalled CC-NUMA - Cache Coherent NUMA
8/9/2019 Concepts of Parallel Programming
12/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 12
Distributed Memory
Processors have their own local memory. Memoryaddresses in one processor do not map to anotherprocessor, so there is no concept of global addressspace across all processors
Distributed memory systems require acommunication network to connect inter-processormemory
The network "fabric" used for data transfer varieswidely; can can be as simple as Ethernet
MemoryCPU
MemoryCPU
MemoryCPU
MemoryCPU
Network
Node 1 Node 2
Node 3 Node 4
8/9/2019 Concepts of Parallel Programming
13/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 13
Comparison
Shared Memory Advantages
Global address space
Data sharing betweentasks is both fast and
uniform Disadvantages
Lack of scalabilitybetween memory andCPUs.
Programmerresponsibility forsynchronization
Expensive
Distributed Memory Advantage
Memory is scalable withnumber of processor
Each processor can
rapidly access ownmemory
Disadvantages NUMA access times
Programmer responsible
for many details Difficult to map existing
data structures
8/9/2019 Concepts of Parallel Programming
14/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 14
Constellations
Hybrid Distributed-Shared Memory Used in most of todays parallel computers
Cache-coherent SMP nodes
Distributed memory is networking of multiple SMP nodes
Network
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPU
CPUCPU
MemoryCPUCPU
CPUCPU
Node 1
Node 3
Node 2
Node 4
8/9/2019 Concepts of Parallel Programming
15/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 15
Example Machines
Comparison of Shared and Distributed Memory ArchitecturesArchitecture CC-UMA CC-NUMA Distributed
Examples
SMPsSun Fire Exxx/VxxxDEC/CompaqSGI Challenge
IBM POWER3
SGI Origin/AltixSequentHP ExemplarDEC/Compaq
IBM POWER4
Cray T3EMasparIBM SPIBM Blue Gene/L
Beowulf Clusters
Communications
MPIThreadsOpenMPshmem
MPIThreadsOpenMPshmem
MPI
Scalability to 10s of processors to 100s of processors to 1000s of processors
Draw BacksLimited memory
bandwidth
New architecture Point-to-pointcommunication
System administrationProgramming is hard todevelop and maintain
Software Availability declining stable Still rising
8/9/2019 Concepts of Parallel Programming
16/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 16
Parallel Programming Models
Abstraction above hardware and memoryarchitecture
Several programming models in use: Shared Memory (parallel computing)
Threads Message Passing (distributed computing)
Data Parallel
Hybrid approaches
All models exist for all hardware/memoryarchitectures
8/9/2019 Concepts of Parallel Programming
17/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 17
Shared Memory Model
Tasks share a common address space, which theyread and write asynchronously
Access control to shared memory via locks orsemaphores
No notion of ownership of data no need toexplicitly communicate data between tasks
Implementations shared memory machines: compiler
distributed memory machines: simulations
8/9/2019 Concepts of Parallel Programming
18/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 18
Threads Model
A single process has multiple, concurrent executionpaths
Most commonly used on shared mem. machines andin operating systems
call sub1call sub2do i = 1, nA(i) = fnct(i^3)
B(i) = A(i) * pend do
call sub3call sub4......
prg.exe T1 T2
T3T4 T
ime
8/9/2019 Concepts of Parallel Programming
19/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 19
Threads Model
Implementations POSIX Thread Library C language only
Offered for most hardware
Very explicit parallelism
Requires significant programmer attention to detail OpenMP
Based on compiler directives; can use sequential code
Fortran, C, C++
portable/multi-platform
Can be very easy and simple to use
8/9/2019 Concepts of Parallel Programming
20/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 20
Message Passing Model
Tasks exchange data through communications bysending and receiving messages
usually requires cooperative operations to beperformed by each process: a send operation must
have a matching receive operation
task 0 task 1
Machine A Machine B
data dataNetwork
send(data) receive(data)
8/9/2019 Concepts of Parallel Programming
21/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 21
Message Passing Model
Implementations Parallel Virtual Machine (PVM) Not much in use any more
Message Passing Interface (MPI) Part 1 released 1994
Part 2 (MPI-2) release 1996 http://www-unix.mcs.anl.gov/mpi/
Now de-facto standard
Fortran, C, C++
Available on virtually all machines OpenMPI, MPICH, LAM/MPI, many vendor specific versions
On shared memory machines, MPI implementations usuallydon't use a network for task communications
http://www-unix.mcs.anl.gov/mpi/http://www-unix.mcs.anl.gov/mpi/8/9/2019 Concepts of Parallel Programming
22/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 22
Data Parallel Model
A set of tasks work collectively on the same datastructure
Each task works on a different partition of thesame data structure
8/9/2019 Concepts of Parallel Programming
23/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 23
Data Parallel Model
Implementations Fortran 90 ISO/ANSI extension of Fortran 77
Additions to program structure and commands
Variable additions methods and arguments
High Performance Fortran (HPF) Contains everything in F90
Directives to tell compiler how to distribute data added
Data parallel constructs added (now part of F95)
On distr. memory machines: translated into MPI code
8/9/2019 Concepts of Parallel Programming
24/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 24
Hybrid Programming Models
Two or more of the previous models are used in thesame program
Common examples: POSIX Threads and Message Passing (MPI)
OpenMP and MPI ClusterOpenMP (Intel)
Works well on network of SMP machines
Also used:
Data Parallel and MPI
8/9/2019 Concepts of Parallel Programming
25/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 25
Designing Parallel Programs
No real parallelizing compilers Compiler knows how to parallelize certain constructs (e.g.loops)
Compiler uses directives from programmer
Not simply a matter of taking sequential algorithmand making it parallel. Sometimes, completelydifferent algorithmic approach necessary
Very time consuming and labor intense task
8/9/2019 Concepts of Parallel Programming
26/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 26
Parallelization Techniques
Domain Decomposition Data is partitioned Each task works on different part of data
Three different ways to partition data
8/9/2019 Concepts of Parallel Programming
27/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 27
Parallelization Techniques
Functional Decomposition Problem is partitioned into set of independent tasks
Both types of decomposition can be and often are combined
8/9/2019 Concepts of Parallel Programming
28/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 28
A little Theory
Some problems can be parallelized very well:In complexity theory, the classNC ("Nick'sClass") is the set of decision problems decidablein poly-logarithmic time on a parallel computer
with a polynomial number of processors. In otherwords, a problem is inNC if there are constants cand ksuch that it can be solved in timeusing parallel processors.
Source: http://en2.wikipedia.org/wiki/Class_NC
O lognc
Onk
http://en2.wikipedia.org/wiki/Class_NChttp://en2.wikipedia.org/wiki/Class_NC8/9/2019 Concepts of Parallel Programming
29/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 29
A little Theory
Some problems can't be parallelized at all! Example: Calculating the Fibonacci Sequence
(1,1,2,3,5,8,13,21,...) by using the formula
Calculation entails dependent calculations: Thecalculation of the k + 2 value uses those of bothk + 1 and k. These three terms cannot be calculatedindependently and therefore, cannot be parallelized.
F1=1
F2=1
Fk2=FkFk1
8/9/2019 Concepts of Parallel Programming
30/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 30
Communication
Decomposed problems typically need to communicate: Partial results need to be combined Changes to neighboring data have effects on a task's data
Some problem don't need communication:
Embarrassingly parallel problems
8/9/2019 Concepts of Parallel Programming
31/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 31
Cost of Communication
Communicating data takes time Inter-task comm. has overhead Often synchronization is necessary
Communication is much more expensive than
computation Communicating data needs to save a lot of computationbefore it pays off
Infiniband needs < 10ms to set up communication
2.4GHz AMD Opteron CPU needs ~0.4ns to perform one
floating point operation (Flop) 25,000 floating point operations per communication setup!
8/9/2019 Concepts of Parallel Programming
32/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 32
Latency - Bandwidth
Latency: the amount of time for the first bit ofdata to arrive at the other end
Bandwidth: how much data per time unit fitsthrough
8/9/2019 Concepts of Parallel Programming
33/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 33
Cost of Communication
Formula for the time needed to transmit data
cost=LN
B
L = Latency [s]N = number of bytes [byte]B = Bandwidth [byte/s]cost [s]
8/9/2019 Concepts of Parallel Programming
34/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 34
Visibility of Communication
With MPI, communication is explicit and very visible Latency Hiding: Communicate and at the same time doing some other
computations
Implementation via parallel threads or non-blocking MPIcommunication functions
Makes programs faster but more complex
8/9/2019 Concepts of Parallel Programming
35/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 35
Scope of Communication
Knowing which tasks must communicate with eachother is critical during the design stage of a parallelprogram Point-to-Point: involves two tasks with one task acting as
the sender/producer of data, and the other acting as the
receiver/consumer Collective: involves data
sharing between more thantwo tasks, which are oftenspecified as being
members in a commongroup, or collective
8/9/2019 Concepts of Parallel Programming
36/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 36
Communication Hardware
Architecture Comment Bandwidth Latency
Myrinet
http://www.myricom.com/
Proprietary
but
commodity
Sust. one-way for
large messages:
~1.2GB/s
short
messages:
~3ms
Infiniband
http://www.infinibandta.org/
Vendor
indep.
standard
~900MB/s(4x HCAs)
~10ms
Quadrics (QsNet)
http://www.quadrics.com/
Expensive,
proprietary
~900MB/s ~2ms
Gigabit Ethernet commodity ~100MB/s ~60ms
Custom: SGI, IBM, Cray, Sun, Compaq, ...
8/9/2019 Concepts of Parallel Programming
37/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 37
Communication Hardware
InfiniBand Proprietary GigE 10GigE
MellanoxMHGA28
QLogicInfiniPath
HT
MyrinetF
Myrinet10G
QuadricsQM500
ChelsioT210-CX
Latency (s) 2.25 1.3 2.6 2.0 1.6 30-100 9.6
Peak Band-width (MB/s)
1502 954 493 1200 910 125 860
N/2 (Bytes)BW (MB/s)
512750
385470
2000250
2000600
1000450
800060
100,000430
CPUoverhead (%)
~5 ~40 ~10 ~10 ~50 >50 ~50
*Mellanox Technology testing; Ohio State University; PathScale, Myricom, Quadrics, and Chelsio websites
http://www.mellanox.com/applications/performance_benchN/2: Message size to achieve half the peak bandwidth
http://www.mellanox.com/applications/performance_benchmarks.phphttp://www.mellanox.com/applications/performance_benchmarks.php8/9/2019 Concepts of Parallel Programming
38/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 38
Synchronization
handshaking between tasks that are sharing data Types of synchronization: Barrier
Usually implies that all tasks are involved
Each task performs its work until it reaches the barrier.
It then stops, or "blocks" When the last task reaches the barrier, all tasks are
synchronized
Used in MPI
8/9/2019 Concepts of Parallel Programming
39/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 39
Synchronization
More types: Lock/Semaphore Can involve any number of tasks
Typically used to serialize (protect) access to global data or asection of code. Only one task at a time may use (own) the
lock / semaphore / flag The first task to acquire the lock "sets" it. This task canthen safely (serially) access the protected data or code.
Other tasks can attempt to acquire the lock but must waituntil the task that owns the lock releases it.
Can be blocking or non-blocking
Used in threads and shared memory
8/9/2019 Concepts of Parallel Programming
40/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 40
Synchronization
More types: Synchronous Communication Operations Involves only those tasks executing a communication
operation
When a task performs a communication operation, some form
of coordination is required with the other task(s)participating in the communication. For example, before atask can perform a send operation, it must first receive anacknowledgment from the receiving task that it is OK tosend.
8/9/2019 Concepts of Parallel Programming
41/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 41
Granularity
Qualitative measure ofComputation / Communication Ratio Typically, periods of computations are separated from
periods if communication by synchronization events
Fine-Grain Parallelism:Small amount of
computation betweencommunication
Coarse-Grain Parallelism:Large amount of
computation betweencommunication
8/9/2019 Concepts of Parallel Programming
42/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 42
Granularity
Fine-Grain Low computation tocommunication ratio
Facilitates load balancing
High communication
overhead; less opportunityfor performanceenhancement
Coarse-Grain High computation tocommunication ratio
More opportunity forperformance increase
Harder to load balanceefficiently
8/9/2019 Concepts of Parallel Programming
43/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 43
Data In- and Output
Parallel computers with thousands of nodes canhandle huge amounts of data
It is hard to get this data in and out of the nodes parallel-I/O systems are still fairly new and not available
for all platforms
I/O over the network (like NFS) causes severe bottlenecks
Help can be found with Parallel File Systems: Lustre, PVFS2, GPFS (IBM)
MPI-2 provides support for parallel file systems
Rule #1: Reduce overall I/O as much as possible!
8/9/2019 Concepts of Parallel Programming
44/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 44
Efficiency
Speedup
Value between zero and one
estimate how well-utilized the processors are in solving theproblem, compared to how much effort is wasted incommunication and synchronization
linear speedup and algorithms running on a single processorhave an efficiency of 1
many difficult-to-parallelize algorithms have efficiencysuch as 1/log p that approaches zero as the number ofprocessors increases
Sp =Ts
Tp
Efficiency =Sp
p
8/9/2019 Concepts of Parallel Programming
45/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 45
Limits and Costs
Besides theoretical limits and hardware limits,there are practical limits to parallel computing
Amdahl's Law states that potential programspeedup is defined by the fraction of code (P) thatcan be parallelized:
If none of the code can be parallelized,P= 0 and the speedup = 1 (no speedup).If all of the code is parallelized,P= 1 and thespeedup is infinite (in theory).
If 50% of the code can be parallelized, maximumspeedup = 2, meaning the code will run twice as fast.
speedup=
1
1P
8/9/2019 Concepts of Parallel Programming
46/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 46
Limits and Costs
Introducing the number of processors performingthe parallel fraction of work, Amdahl's Law can bereformulated as
speedup=1
P
NS
N = number of processors,
P = parallel fraction andS=1-P = serial fraction
Speedup
http://upload.w
ikimedia.org/wikipedia/en
/7/7a/Amdahl-law.jpg
N P=0.50 P=0.90 P=0.99 P=1.0
10 1.82 5.26 9.17 10100 1.98 9.17 50.25 100
1000 1.99 9.91 90.99 1000
10000 1.99 9.99 99.02 10000
8/9/2019 Concepts of Parallel Programming
47/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 47
Typical Parallel Applications
Applications that are well suited for parallelcomputers are Weather and ocean patterns
Finite Element Method (FEM; crash tests for cars)
Fluid dynamics, aerodynamics
Simulation of electro-magnetic problems
8/9/2019 Concepts of Parallel Programming
48/49
Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 48
Summary
Overview of parallel computing concepts Hardware Software
Programming
Problems of parallel computing Communication is expensive (latency)
I/O is expensive
Techniques to work around these problems Problem decomposition (communicate larger data)
Parallel File Systems plus supporting hardware
$$$$ (faster communication fabric)
8/9/2019 Concepts of Parallel Programming
49/49
Acknowledgment/References
Most of this talk is taken fromhttp://www.llnl.gov/computing/tutorials/parallel_comp/ Theory book Introduction to Parallel Algorithms and
Architectures: Arrays, Trees, Hypercubes by F.Thomson Leighton
Hardware book Computer Architecture: AQuantitative Approach (3rd edition) by John L.Hennessy, David A. Patterson, David Goldberg
http://www.top500.org/
http://www.llnl.gov/computing/tutorials/parallel_comp/http://www.top500.org/http://www.top500.org/http://www.llnl.gov/computing/tutorials/parallel_comp/