Parallel Algorithms and Applications(2012)

transcript

Parallel Algorithms

January 2012

Kenichi Miura

Classification of Computational Models (Miura 1980)

• Continuum Model - Fluid Model(Eulerean View) - Discretization of PDEs • Particle Model - Many-body Problems (Lagrangean View) - Discretization of ODEs (e.g., Newton’s Equations) • Structural Model - Discrete Model - Sparse Matrix Formulation • Mathematical Transform - Fourier Transform - Linear algebra

High-end simulation in the physical sciences consists of seven algorithms:

1. Structured Grids (including locally structured grids, e.g. AMR)

2. Unstructured Grids 3. Fast Fourier Transform 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo Well-defined targets from algorithmic and software standpoint. Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004

Phillip Colella’s “Seven dwarves(2004)”

High-end simulation in the physical sciences consists of thirteen algorithms: 1. Dense Linear Algebra 2. Sparse Linear Algebra 3. Spectral Methods (Fast Fourier Transform) 4. N-Body Methods 5. Structured Grids 6. Unstructured Grids 7. MapReduce (including Monte Carlo Methods) 8. Combinational Logic 9. Graph Traversal 10. Dynamic Programming 11. Backtrack and Branch-and-Bound 12. Graphical Models 13. Finite state machines

Phillip Colella’s “Thirteen Dwarfs(2006)”

Steps in Conducting Simulations

• Physical Phenomena • Modeling and mathematical formulation • Algorithm selection/development • Programming • Run on hardware platform • Verification of the results

• SISD （Sequential Processing） • SIMD （Lock-step，Data Parallel） • MIMD （Control Parallel） • SPMD （A Variation of MIMD； Data Parallel Model on an MMD Machine）

Mike Flynn (1967)

Parallelism from the viewpoint of Computational Models and

Parallelism Description

Parallel Programming -A Necessary Evil?-

• It is most ideal to be able to obtain performance (i.e., shorter wall-clock time), without doing anything with the codes ---- Automatic parallelizing compiler

• Why doesn’t it work so easily in many cases? ‐Computational algorithm is inherently serial Examples: Recursive formulation, many branches ‐Algorithm may be parallelizable, but actual implementation of the code is NOT. ‐Employed data structure is not suitable for parallel processing Examples: Stack vs FIFO, Array vs Linked List）

Bottlenecks in Parallel Processing

• Overhead in Creating and Finalizing Tasks • Overhead in Synchronization • Significant fraction of non-parallel portion of the code (Amdahl’s Law ) • Overhead in Data Transfer (Latency，Bandwidth for the

Distributed Memory architecture • Memory Contention for Shared Memory architecture • Lack in Load Balancing

Parallel Processing and Amdahl’s Law

Can a program be run faster in proportion to the number of processors?

Synchronous Parallel Processing Model Barrier Model Amdahl（1967） Asynchronous Parallel Processing Model Critical Section Model Miura（1991）

Synchronous Parallel Processing Model

Barrier Model Gene Amdahl (1967)

Sp(n) = 1/(1-α + α/n) α 1−α

Serial

Parallel

Dr. G.M.Amdahl (2008)

Amdahl’s Law

Sp(n) = 1/(1−α + α/n)

Asynchronous Parallel Processing Model

Critical Section Model Miura (1991)

Critical Section

Queuing Model (M/M/1)

Asynchronous Case (Miura)

Load Imbalance Model

T= Σ Ti (i=1,......,n) Sp (n) = T/(Max(Ti)) < n Speed-up factor is n, if T1 = T2 =・・・・= Tn

Synchronization Overhead Model Load-balance is assumed. Sp(n)=T(1)/T (n) = 1/(1/n + ε n ) =n/(1+ ε n2) (Linear Overhead) Sp(n)=T(1)/T (n) = 1/(1/n + ε log2n ) =n/(1+ ε n log2n ) (Logarithmic Overhead)

Speed-up

0 10 20 30 40 500

0 500 1000 1500 2000

ε =.001

Speed-up

Hockney’s Performance Model (n1/2 Model)

T(n) = TOH + n*τ Total time Overhead execution time or P(n) = Roo n/ (n + n1/2) P(n) = n/T(n) Roo : Peak Performance n1/2: n which gives half of peak performance

Performance Model for Vector and Parallel Supercomputers time

performance

n n1/2 Note: n refers to the problem size, not the number of processors

SIMD Computational Model （ Data Parallel）

• Simple Parallelism（Vector, Matrix） ci=ai+bi，C=A+B

• Reduction s = a１+ a２+ a３+ a４+ • Broadcast ai = s • Shift/Rotate ai= bi-k， • Recurrence ai = ai-1 + bi，← Problem！

Note：Vector processing is also included in this category

SIMD Computational Model and

Vector/Parallel Processing

‐ Both vectorization and Parallelization detect identical but independently executable arithmetic operations ‐ Vectorization: Search from Innermost loop outward vs - Parallelization: Search from Outermost loop inward ‐ Same for both cases when partitioning data in the Simple loops or Innermost loops

Examples where modification of algorithm is necessary (1)

－ Simple Recurrence － Suitable for serial computing （Data locality, Better utilization of memory etc.） Aｉ＝ ki Ai-1 Aｉ = kｉ ki-1 …… k3 k2 k1 A0

k1 k2 k3 k4 k5 k6 k7 k8 Recursive Doubling

aｉ＝ kｉ ai-1 ＋ bｉ aｉ = k ｉ bｉ aｉ-1

1 0 1 1 ai = MiMi-1…..M2M1 a0

Examples where modification of algorithm is necessary (2) - Linear Recurrence -

Μ1 Μ2 Μ3 Μ4 Μ5 Μ6 Μ7 Μ8

Kｉ-1,j-1 Cｉ-１,j-1 ＋Kｉ-1,j+1 Cｉ-１,j+1 ＋Ki,jCi,j ＋ Kｉ+1,j-1 Cｉ+１,j-1 ＋ K ｉ+1,j+1 C ｉ+１,j+1 ＝ d ｉ

Examples of Recurrence Formula for Iterative Methods

Kｉ-１ Cｉ－１＋Kｉ Cｉ＋ K ｉ+１ C ｉ+１＝ d ｉ

One-Dimensional Case

Two- Dimensional Case

Cyclic Reduction for Tridiagonal Equations

b1 c1 a2 b2 c2 a3 b3 c3

x1 x2 x3

k1 k2 k3

Cyclic Reduction（Serial)

Cyclic Reduction（Parallel)

Random Number Generation Algorithms (1) Linear Congruencial Method:X n = (a X n-1 + c) Mod M Multiplicative ： c=0. → Period = 2j-2 Mixed ： c .ne. 0. → Period = 2j where M=2j （usually Machine Word Size) (2) Binary M-sequence with Primitive Trinomial: X n = (Xn-m .eor. Xn-k ) Mod 2 → Period ＝ 2k-1 (m<k) (3) Generalized Fibbonacci Method with Primitive Trinomial: X n = (Xn-m op. Xn-k ) Mod M → Period =(2k-1)2j-1~ 2 k+j-1

where op. is {+, ‐, *}. (4) Generalized Recurrence Method With Large Prime Modulus (Multiple Recursive Generator or MRG) X n = (a1 Xn-1 + a2 Xn-2 + a3 Xn-3 +…..+ak Xn-k) Mod p → Period =pk-1~ 2 j*k ,where p ~ 2 j is a Prime Number (e.g., 2 31 – 1)

Consider p=231 – 1 and k = 8. f(x) = (x8- a1x7 – a2x6 – a3x5 – a4x4 – a5x3 – a6x2 – a7x – a8) mod ( p )

- Good Lattice Structure with Full Coefficients - Simple and Fast Implementation with 64 bit Arithmetic, when modulus p = 231 - 1 - Long Period: (231-1)8 – 1 ~ 4.5*1074

- % of Primitive Polynomials: φ (pk – 1)/k/(pk – 1) = 2.2% - Easy to Extend for Vector/Parallel Processing

An Example of MRG with a 8th-order Full Primitive Polynomial

Vectorization and Parallelization of MRG - Decimating the Sequence -

a1 a2 a3 a4 a5 a6 a7 a8 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 A = 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 ,

xn-3 x = xn-4

xn-7 xn-8 ,

Compute A2, A4, A8,……….Mod(p) , once. Obtain the new polynomial by multiplying the matrices.

xn-2 x’ = xn-3

xn-6 xn-7

Then x’ = A x Mod(p).

Transfer Matrix:

Vectorization and Parallelization of MRG (Continued)

In order to compute xn = An x0 mod(p): (1) Store Aj = A2 j mod(p)

(j=0,1,2,3,4,……).

(2) Represent n in the binary form, e.g., (bm-1,…..,b2,b1,b0). (3) Multiply I by Aj mod(p)

when bj=1 (j=0,….,m-1),

where I is a k-th order identity matrix. Note: The same strategy works with the polynomial arithmetic (Knuth).

SPMD Computational Model -Application of Data Parallel to MIMD Architecture-

• A “SINGLE” Program for all processors • Multiple instruction streams at Execution time • Each processor takes care of a portion of large-

sized data. Its behavior is similar to SIMD, but is allowed locally independent operations with, say, conditional branches

Necessity for various synchronization mechanisms and their description

MIMD Computational Model -Control Parallel or Task Parallel-

• Master ‐Slave type operations • Fork, Join Construct • Synchronization mechanisms and their description （Barrier，Semaphor，Lock/Unlock） • Data Transfer( in case of Distributed Memory) （Send/Receive protocol） Example： Event Parallel Transport Monte Carlo Simulation, Ray Tracing

Parallel Prefix

Source: L.Snyder “Paralle Progrmming”

Three Models of Dense Matrix Multiplications

• Inner Product Do i = 1, n Do j = 1, n Do k = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

Cij = Σ Aik * Bkj

• Middle Product Do j = 1, n Do k = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

• Outer Product Do k = 1, n Do j = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

Three Models of Dense Matrix Multiplications (1)

• Inner Product Do i = 1, n Do j = 1, n sum=0 Do k = 1, n sum = sum+ A(i,k)*B(k,j) enddo C(i,j) = sum enddo enddo

Cij = Σ Aik * Bkj k

Three Models of Dense Matrix Multiplications (2)

• Middle Product Do j = 1, n Do k = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

Three Models of Dense Matrix Multiplications(3)

• Outer Product Do k = 1, n Do j = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

Strassen’s Algorithm for Matrix Multiply

• Multiplication of two matrices is one of the most basic operations of linear algebra and scientific computing.

• Conventional standard algorithm for n x n matrices requires Ο(n3) operations.

• Strassen’s algorithm, introduced 1969, has maximum operations of Ο(nlog2(7)) ≈ Ο(n2.807).

• Another Divide-and-Conquer approach.

Reference: 1. V. Strassen, “Gaussian Elimination is Not Optimal.” Journal of Numerical Mathematics, 13:354-356

2. S. Huss-Lederman etc. “Implementation of Strassen’s Algorithm for Matrix Multiplication”, SC96 Technical Paper.

Conventional Matrix Multiply

No. of scalar multiplication = n3

No. of scalar addition = n3 - n2

No. of total arithmetic operations = 2 n3 - n2

A,B,C are n by n matrices.

C = A * B

Conventional Matrix Multiply (with submatrices)

B22 = *

C11 = A11 B11 + A12 B21 C12 = A11 B12 + A12 B22

C21 = A21 B11 + A22 B21 C22 = A21 B12 + A22 B22

No. of matrix multiplications = 8

No. of matrix additions = 4

Total arithmetic operations = 8(2 (n/2)3 – (n/2)2) + 4 (n/2)2 = 2 n3 - n2

Strassen’s Algorithm - 1/3

• Strassen’s method has fewer multiply and offsets by more additions and subtractions.

• For each pair of sub-matrices, there are 7 multiplications and 18 additions/subtractions.

• Among these operations, 7 multiplications and 10 additions/subtractions are in steps of calculating P’s.

= * C11 C21

C12 C22

A11 A21

A12 A22

B11 B21

B12 B22

C11 = P1 + P4 - P5 + P7 C12 = P3 + P5 C21 = P2 + P4 C22 = P1 + P3 - P2 + P6

P1 = ( A11 + A22 )( B11 + B22 ) P5 = (A11 + A12 ) B22 P2 = ( A21 + A22 ) B11 P6 = (A21 - A11 )( B11 + B12 ) P3 = A11 ( B12 - B22 ) P7 = (A12 - A22 )( B21 + B22 ) P4 = A22 ( B21 - B11 )

Strassen’s Algorithm - 2/3

• On 2 x 2 matrices, the count of arithmetic operations is:

Mult Add Complexity Conventional 8 4 16n3 - 4n2 Strassen 7 18 14n3 + 11n2 • On matrix multiply is replaced by 14 matrix

additions.

Strassen’s Algorithm - 3/3 • In one level of Strassen’s algorithm applied to 2 x 2

matrices with elements of (n/2 )x (n/2) blocks and the conventional algorithm is used for the seven block matrix multiplications, the total number of operation count is

7(2(n/2)3 - (n/2)2) + 18(n/2)2 = (7/4)n3 + (11/4)n2

R = Strassen Operation Count

Conventional Operation Count = 7n3 + 11n2

8n3 - 4n2

lim R = 7/8 n→∞ 12.5% improvement!

Note: code can be accessed from:http://www-unix.mcs.anl.gov/prism/lib/software.html

Winograd’s Variant of Strassen’s Algorithm

• Based on Strassen’s algorithm, Winograd reduced 3 of additions/subtractions by rearranging the order of calculation into 4 stages.

S1 = A21 + A22 T1 = B12 - B11 P1 = A11 B11 U1 = P1 + P2 S2 = S1 - A11 T2 = B22 - T1 P2 = A12 B21 U2 = P1 + P4 S3 = A11 - A21 T3 = B22 - B12 P3= S1 T1 U3 = U2 + P5 S4 = A12 - S2 T4 = B21 - T2 P4 = S2 T2 U4 = U3 + P7 P5 = S3 T3 U5 = U3 + P3 P6 = S4 B22 U6 = U2 + P3 P7 = A22 T4 U7 = U6 + P6

Stage 1 Stage 2 Stage 3 Stage 4

C11 = U1 , C12 = U7 , C21 = U4 , C22 = U5

7 multipy, 15 add./sub.

( Takes 7k multiplications and 5(7k – 4k) additions to Multiply 2k x 2k matirices.)

Usage of Strassen’s Algorithm

• Usage of Strassen’s algorithm is limited by – Additional memory is required to store matrices P’s. – More memory traffic are necessary, memory bandwidth plays a key role.

• Loss of significance in Strassen’s algorithm: – Caused by adding relatively large and very small numbers.

Performance Example (over Cray Library MXM on Cray2 ) n=64 => x 1.35 n=2048 => x 2.01 (1988 by David Bailey)

Fourier Transform and FFT

• Discrete Fourier Transform(DFT)

Zi =1/n Σ ωi k * Xk where ω = exp(-2πj/n)

O(n2) → O(log n)

• Butterfly operation (DIT:Decimation in Time)

Z = X + αnk ∗ Y

W = X - αnk * Y

where αn = exp(2πj/n)

− ∗

• Butterfly operation (DIF: Decimation in Frequency)

Z = X + Y W = (X – Y)∗αn

where αn = exp(2πj/n)

− ∗

Number of Operations for Complex Fourier Transform ( N:power of 2)

FFT per butterfly : add/sub= 6, Mult = 4 N points: Total ops. =5 N log2 N DFT (Complex Matrix-Vector multiplication) 1 point: add/sub = 4 N-2, Mult = 4N N points: Total ops. = 8N2 – 2N = 2 N(4 N-1)

If the efficiency of FFT is 1%, 512 point DFT is faster than FFT If the efficiency of FFT is 3%, 128 point DFT is faster than FFT If the efficiency of FFT is 5%, 64 point DFT is faster than FFT

Fast Fourier Transform

FFT (Isogeometric)

FFT (Self-sorting variant due to Stockham)

Applications

Concept of Numerical Weather Prediction (Richardson, 1922)

Simulation of Precipitation

Simulation of typhoon

Source: Earth Simulator Center

Simulation of Tsunami (March 11, 2011)

Source: Prof. Imamura, Tohoku Univ.

Seismic Data Processing for Oil Exploration

Applications Areas for Petaflops • Computational testing and simulation as

a replacement for weapons testing (stockpile stewardship)

• Simulation of plasma fusion devices and basic physics for controlled fusion (to optimize design of future reactors)

• Design of new chemical compounds and synthesis pathways (environmental safety and cost improvements)

• Comprehensive modeling of groundwater and oil reservoirs (contamination and management)

• Modeling of complex transportation, communication and economic systems

• Time dependent simulations of complex biomolecules (membranes, synthesis machinery and dna)

• Multidisciplinary optimization problems combining structures, fluids and geometry

• Modeling of integrated earth systems (ocean, atmosphere, bio-geosphere)

• Improved 4d/6d data assimilation capability applied to remote sensing and environmental models

• Computational cosmology (integration of particle models, astrophysical fluids and radiation transport)

• Materials simulations that bridge the gap between microscale and macroscale (bulk materials)

• Coupled electro-mechanical simulations of nano-scale structures (dynamics and mechanics of micromachines)

• Full plant optimization for complex processes (chemical, manufacturing and assembly problems)

• High-resolution reacting flow problems (combustion, chemical mixing and multiphase flow)

• High-realism immersive virtual reality based on real-time radiosity modeling and complex scenes

Green, Blue, Red

6/15, 6/15, 3/15

Parallel Algorithms and Applications(2012)

Documents