Date post: | 17-Oct-2014 |
Category: |
Documents |
Upload: | rajeshsingh123 |
View: | 15 times |
Download: | 3 times |
Parallel Algorithms
January 2012
Kenichi Miura
Classification of Computational Models (Miura 1980)
• Continuum Model - Fluid Model(Eulerean View) - Discretization of PDEs • Particle Model - Many-body Problems (Lagrangean View) - Discretization of ODEs (e.g., Newton’s Equations) • Structural Model - Discrete Model - Sparse Matrix Formulation • Mathematical Transform - Fourier Transform - Linear algebra
High-end simulation in the physical sciences consists of seven algorithms:
1. Structured Grids (including locally structured grids, e.g. AMR)
2. Unstructured Grids 3. Fast Fourier Transform 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo Well-defined targets from algorithmic and software standpoint. Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004
Phillip Colella’s “Seven dwarves(2004)”
High-end simulation in the physical sciences consists of thirteen algorithms: 1. Dense Linear Algebra 2. Sparse Linear Algebra 3. Spectral Methods (Fast Fourier Transform) 4. N-Body Methods 5. Structured Grids 6. Unstructured Grids 7. MapReduce (including Monte Carlo Methods) 8. Combinational Logic 9. Graph Traversal 10. Dynamic Programming 11. Backtrack and Branch-and-Bound 12. Graphical Models 13. Finite state machines
Phillip Colella’s “Thirteen Dwarfs(2006)”
Steps in Conducting Simulations
• Physical Phenomena • Modeling and mathematical formulation • Algorithm selection/development • Programming • Run on hardware platform • Verification of the results
• SISD (Sequential Processing) • SIMD (Lock-step,Data Parallel) • MIMD (Control Parallel) • SPMD (A Variation of MIMD; Data Parallel Model on an MMD Machine)
Mike Flynn (1967)
Parallelism from the viewpoint of Computational Models and
Parallelism Description
Parallel Programming -A Necessary Evil?-
• It is most ideal to be able to obtain performance (i.e., shorter wall-clock time), without doing anything with the codes ---- Automatic parallelizing compiler
• Why doesn’t it work so easily in many cases? ‐Computational algorithm is inherently serial Examples: Recursive formulation, many branches ‐Algorithm may be parallelizable, but actual implementation of the code is NOT. ‐Employed data structure is not suitable for parallel processing Examples: Stack vs FIFO, Array vs Linked List)
Bottlenecks in Parallel Processing
• Overhead in Creating and Finalizing Tasks • Overhead in Synchronization • Significant fraction of non-parallel portion of the code (Amdahl’s Law ) • Overhead in Data Transfer (Latency,Bandwidth for the
Distributed Memory architecture • Memory Contention for Shared Memory architecture • Lack in Load Balancing
Parallel Processing and Amdahl’s Law
Can a program be run faster in proportion to the number of processors?
Synchronous Parallel Processing Model Barrier Model Amdahl(1967) Asynchronous Parallel Processing Model Critical Section Model Miura(1991)
Synchronous Parallel Processing Model
Barrier Model Gene Amdahl (1967)
Sp(n) = 1/(1-α + α/n) α 1−α
Serial
Parallel
Dr. G.M.Amdahl (2008)
Amdahl’s Law
Sp(n) = 1/(1−α + α/n)
Asynchronous Parallel Processing Model
Critical Section Model Miura (1991)
P
P
P
P
C
queue
Critical Section
Queuing Model (M/M/1)
Asynchronous Case (Miura)
Load Imbalance Model
T1
T2 T3
Tn
T= Σ Ti (i=1,......,n) Sp (n) = T/(Max(Ti)) < n Speed-up factor is n, if T1 = T2 =・・・・= Tn
Synchronization Overhead Model Load-balance is assumed. Sp(n)=T(1)/T (n) = 1/(1/n + ε n ) =n/(1+ ε n2) (Linear Overhead) Sp(n)=T(1)/T (n) = 1/(1/n + ε log2n ) =n/(1+ ε n log2n ) (Logarithmic Overhead)
Speed-up
0 10 20 30 40 500
5
10
15
20
0 500 1000 1500 2000
20
40
60
80
100
ε =.001
n
n
Speed-up
Hockney’s Performance Model (n1/2 Model)
T(n) = TOH + n*τ Total time Overhead execution time or P(n) = Roo n/ (n + n1/2) P(n) = n/T(n) Roo : Peak Performance n1/2: n which gives half of peak performance
Performance Model for Vector and Parallel Supercomputers time
n
TOH
performance
Roo
Roo/2
n n1/2 Note: n refers to the problem size, not the number of processors
SIMD Computational Model ( Data Parallel)
• Simple Parallelism(Vector, Matrix) ci=ai+bi,C=A+B
• Reduction s = a1+ a2+ a3+ a4+ • Broadcast ai = s • Shift/Rotate ai= bi-k, • Recurrence ai = ai-1 + bi,← Problem!
Note:Vector processing is also included in this category
SIMD Computational Model and
Vector/Parallel Processing
‐ Both vectorization and Parallelization detect identical but independently executable arithmetic operations ‐ Vectorization: Search from Innermost loop outward vs - Parallelization: Search from Outermost loop inward ‐ Same for both cases when partitioning data in the Simple loops or Innermost loops
Examples where modification of algorithm is necessary (1)
- Simple Recurrence - Suitable for serial computing (Data locality, Better utilization of memory etc.) Ai = ki Ai-1 Ai = ki ki-1 …… k3 k2 k1 A0
k1 k2 k3 k4 k5 k6 k7 k8 Recursive Doubling
ai = ki ai-1 + bi ai = k i bi ai-1
1 0 1 1 ai = MiMi-1…..M2M1 a0
Examples where modification of algorithm is necessary (2) - Linear Recurrence -
Μ1 Μ2 Μ3 Μ4 Μ5 Μ6 Μ7 Μ8
Ki-1,j-1 Ci-1,j-1 +Ki-1,j+1 Ci-1,j+1 +Ki,jCi,j + Ki+1,j-1 Ci+1,j-1 + K i+1,j+1 C i+1,j+1 = d i
Examples of Recurrence Formula for Iterative Methods
Ki-1 Ci-1 +Ki Ci + K i+1 C i+1 = d i
One-Dimensional Case
Two- Dimensional Case
Cyclic Reduction for Tridiagonal Equations
b1 c1 a2 b2 c2 a3 b3 c3
x1 x2 x3
=
k1 k2 k3
0
0
*
Cyclic Reduction(Serial)
Cyclic Reduction(Parallel)
Random Number Generation Algorithms (1) Linear Congruencial Method:X n = (a X n-1 + c) Mod M Multiplicative : c=0. → Period = 2j-2 Mixed : c .ne. 0. → Period = 2j where M=2j (usually Machine Word Size) (2) Binary M-sequence with Primitive Trinomial: X n = (Xn-m .eor. Xn-k ) Mod 2 → Period = 2k-1 (m<k) (3) Generalized Fibbonacci Method with Primitive Trinomial: X n = (Xn-m op. Xn-k ) Mod M → Period =(2k-1)2j-1~ 2 k+j-1
where op. is {+, ‐, *}. (4) Generalized Recurrence Method With Large Prime Modulus (Multiple Recursive Generator or MRG) X n = (a1 Xn-1 + a2 Xn-2 + a3 Xn-3 +…..+ak Xn-k) Mod p → Period =pk-1~ 2 j*k ,where p ~ 2 j is a Prime Number (e.g., 2 31 – 1)
Consider p=231 – 1 and k = 8. f(x) = (x8- a1x7 – a2x6 – a3x5 – a4x4 – a5x3 – a6x2 – a7x – a8) mod ( p )
- Good Lattice Structure with Full Coefficients - Simple and Fast Implementation with 64 bit Arithmetic, when modulus p = 231 - 1 - Long Period: (231-1)8 – 1 ~ 4.5*1074
- % of Primitive Polynomials: φ (pk – 1)/k/(pk – 1) = 2.2% - Easy to Extend for Vector/Parallel Processing
An Example of MRG with a 8th-order Full Primitive Polynomial
Vectorization and Parallelization of MRG - Decimating the Sequence -
a1 a2 a3 a4 a5 a6 a7 a8 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 A = 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 ,
xn-1
xn-2
xn-3 x = xn-4
xn-5
xn-6
xn-7 xn-8 ,
Compute A2, A4, A8,……….Mod(p) , once. Obtain the new polynomial by multiplying the matrices.
xn
xn-1
xn-2 x’ = xn-3
xn-4
xn-5
xn-6 xn-7
Then x’ = A x Mod(p).
Transfer Matrix:
Vectorization and Parallelization of MRG (Continued)
In order to compute xn = An x0 mod(p): (1) Store Aj = A2 j mod(p)
(j=0,1,2,3,4,……).
(2) Represent n in the binary form, e.g., (bm-1,…..,b2,b1,b0). (3) Multiply I by Aj mod(p)
when bj=1 (j=0,….,m-1),
where I is a k-th order identity matrix. Note: The same strategy works with the polynomial arithmetic (Knuth).
SPMD Computational Model -Application of Data Parallel to MIMD Architecture-
• A “SINGLE” Program for all processors • Multiple instruction streams at Execution time • Each processor takes care of a portion of large-
sized data. Its behavior is similar to SIMD, but is allowed locally independent operations with, say, conditional branches
Necessity for various synchronization mechanisms and their description
MIMD Computational Model -Control Parallel or Task Parallel-
• Master ‐Slave type operations • Fork, Join Construct • Synchronization mechanisms and their description (Barrier,Semaphor,Lock/Unlock) • Data Transfer( in case of Distributed Memory) (Send/Receive protocol) Example: Event Parallel Transport Monte Carlo Simulation, Ray Tracing
Parallel Prefix
Source: L.Snyder “Paralle Progrmming”
Three Models of Dense Matrix Multiplications
• Inner Product Do i = 1, n Do j = 1, n Do k = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
Cij = Σ Aik * Bkj
• Middle Product Do j = 1, n Do k = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
• Outer Product Do k = 1, n Do j = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
Three Models of Dense Matrix Multiplications (1)
• Inner Product Do i = 1, n Do j = 1, n sum=0 Do k = 1, n sum = sum+ A(i,k)*B(k,j) enddo C(i,j) = sum enddo enddo
= *
Cij = Σ Aik * Bkj k
Three Models of Dense Matrix Multiplications (2)
• Middle Product Do j = 1, n Do k = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
= * +
Cij = Σ Aik * Bkj k
Three Models of Dense Matrix Multiplications(3)
• Outer Product Do k = 1, n Do j = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
= * +
Cij = Σ Aik * Bkj k
Strassen’s Algorithm for Matrix Multiply
• Multiplication of two matrices is one of the most basic operations of linear algebra and scientific computing.
• Conventional standard algorithm for n x n matrices requires Ο(n3) operations.
• Strassen’s algorithm, introduced 1969, has maximum operations of Ο(nlog2(7)) ≈ Ο(n2.807).
• Another Divide-and-Conquer approach.
Reference: 1. V. Strassen, “Gaussian Elimination is Not Optimal.” Journal of Numerical Mathematics, 13:354-356
2. S. Huss-Lederman etc. “Implementation of Strassen’s Algorithm for Matrix Multiplication”, SC96 Technical Paper.
Conventional Matrix Multiply
No. of scalar multiplication = n3
No. of scalar addition = n3 - n2
No. of total arithmetic operations = 2 n3 - n2
A,B,C are n by n matrices.
C = A * B
Conventional Matrix Multiply (with submatrices)
C11
C21
C12
C22
A11
A21
A12
A22
B11
B21
B12
B22 = *
C11 = A11 B11 + A12 B21 C12 = A11 B12 + A12 B22
C21 = A21 B11 + A22 B21 C22 = A21 B12 + A22 B22
No. of matrix multiplications = 8
No. of matrix additions = 4
Total arithmetic operations = 8(2 (n/2)3 – (n/2)2) + 4 (n/2)2 = 2 n3 - n2
Strassen’s Algorithm - 1/3
• Strassen’s method has fewer multiply and offsets by more additions and subtractions.
• For each pair of sub-matrices, there are 7 multiplications and 18 additions/subtractions.
• Among these operations, 7 multiplications and 10 additions/subtractions are in steps of calculating P’s.
= * C11 C21
C12 C22
A11 A21
A12 A22
B11 B21
B12 B22
C11 = P1 + P4 - P5 + P7 C12 = P3 + P5 C21 = P2 + P4 C22 = P1 + P3 - P2 + P6
P1 = ( A11 + A22 )( B11 + B22 ) P5 = (A11 + A12 ) B22 P2 = ( A21 + A22 ) B11 P6 = (A21 - A11 )( B11 + B12 ) P3 = A11 ( B12 - B22 ) P7 = (A12 - A22 )( B21 + B22 ) P4 = A22 ( B21 - B11 )
Strassen’s Algorithm - 2/3
• On 2 x 2 matrices, the count of arithmetic operations is:
Mult Add Complexity Conventional 8 4 16n3 - 4n2 Strassen 7 18 14n3 + 11n2 • On matrix multiply is replaced by 14 matrix
additions.
Strassen’s Algorithm - 3/3 • In one level of Strassen’s algorithm applied to 2 x 2
matrices with elements of (n/2 )x (n/2) blocks and the conventional algorithm is used for the seven block matrix multiplications, the total number of operation count is
7(2(n/2)3 - (n/2)2) + 18(n/2)2 = (7/4)n3 + (11/4)n2
R = Strassen Operation Count
Conventional Operation Count = 7n3 + 11n2
8n3 - 4n2
lim R = 7/8 n→∞ 12.5% improvement!
Note: code can be accessed from:http://www-unix.mcs.anl.gov/prism/lib/software.html
Winograd’s Variant of Strassen’s Algorithm
• Based on Strassen’s algorithm, Winograd reduced 3 of additions/subtractions by rearranging the order of calculation into 4 stages.
S1 = A21 + A22 T1 = B12 - B11 P1 = A11 B11 U1 = P1 + P2 S2 = S1 - A11 T2 = B22 - T1 P2 = A12 B21 U2 = P1 + P4 S3 = A11 - A21 T3 = B22 - B12 P3= S1 T1 U3 = U2 + P5 S4 = A12 - S2 T4 = B21 - T2 P4 = S2 T2 U4 = U3 + P7 P5 = S3 T3 U5 = U3 + P3 P6 = S4 B22 U6 = U2 + P3 P7 = A22 T4 U7 = U6 + P6
Stage 1 Stage 2 Stage 3 Stage 4
C11 = U1 , C12 = U7 , C21 = U4 , C22 = U5
7 multipy, 15 add./sub.
( Takes 7k multiplications and 5(7k – 4k) additions to Multiply 2k x 2k matirices.)
Usage of Strassen’s Algorithm
• Usage of Strassen’s algorithm is limited by – Additional memory is required to store matrices P’s. – More memory traffic are necessary, memory bandwidth plays a key role.
• Loss of significance in Strassen’s algorithm: – Caused by adding relatively large and very small numbers.
Performance Example (over Cray Library MXM on Cray2 ) n=64 => x 1.35 n=2048 => x 2.01 (1988 by David Bailey)
Fourier Transform and FFT
• Discrete Fourier Transform(DFT)
Zi =1/n Σ ωi k * Xk where ω = exp(-2πj/n)
O(n2) → O(log n)
Fourier Transform and FFT
• Butterfly operation (DIT:Decimation in Time)
Z = X + αnk ∗ Y
W = X - αnk * Y
where αn = exp(2πj/n)
X
Y
Z
W
+
− ∗
αn
Fourier Transform and FFT
• Butterfly operation (DIF: Decimation in Frequency)
Z = X + Y W = (X – Y)∗αn
k
where αn = exp(2πj/n)
X
Y
Z
W
+
− ∗
αn
Number of Operations for Complex Fourier Transform ( N:power of 2)
FFT per butterfly : add/sub= 6, Mult = 4 N points: Total ops. =5 N log2 N DFT (Complex Matrix-Vector multiplication) 1 point: add/sub = 4 N-2, Mult = 4N N points: Total ops. = 8N2 – 2N = 2 N(4 N-1)
If the efficiency of FFT is 1%, 512 point DFT is faster than FFT If the efficiency of FFT is 3%, 128 point DFT is faster than FFT If the efficiency of FFT is 5%, 64 point DFT is faster than FFT
Fast Fourier Transform
FFT (Isogeometric)
FFT (Self-sorting variant due to Stockham)
Applications
Concept of Numerical Weather Prediction (Richardson, 1922)
Simulation of Precipitation
Simulation of typhoon
Source: Earth Simulator Center
Simulation of Tsunami (March 11, 2011)
Source: Prof. Imamura, Tohoku Univ.
Seismic Data Processing for Oil Exploration
Applications Areas for Petaflops • Computational testing and simulation as
a replacement for weapons testing (stockpile stewardship)
• Simulation of plasma fusion devices and basic physics for controlled fusion (to optimize design of future reactors)
• Design of new chemical compounds and synthesis pathways (environmental safety and cost improvements)
• Comprehensive modeling of groundwater and oil reservoirs (contamination and management)
• Modeling of complex transportation, communication and economic systems
• Time dependent simulations of complex biomolecules (membranes, synthesis machinery and dna)
• Multidisciplinary optimization problems combining structures, fluids and geometry
• Modeling of integrated earth systems (ocean, atmosphere, bio-geosphere)
• Improved 4d/6d data assimilation capability applied to remote sensing and environmental models
• Computational cosmology (integration of particle models, astrophysical fluids and radiation transport)
• Materials simulations that bridge the gap between microscale and macroscale (bulk materials)
• Coupled electro-mechanical simulations of nano-scale structures (dynamics and mechanics of micromachines)
• Full plant optimization for complex processes (chemical, manufacturing and assembly problems)
• High-resolution reacting flow problems (combustion, chemical mixing and multiphase flow)
• High-realism immersive virtual reality based on real-time radiosity modeling and complex scenes
Green, Blue, Red
6/15, 6/15, 3/15