+ All Categories
Home > Documents > Parallel Algorithms and Applications(2012)

Parallel Algorithms and Applications(2012)

Date post: 17-Oct-2014
Category:
Upload: rajeshsingh123
View: 15 times
Download: 3 times
Share this document with a friend
65
Parallel Algorithms January 2012 Kenichi Miura
Transcript
Page 1: Parallel Algorithms and Applications(2012)

Parallel Algorithms

January 2012

Kenichi Miura

Page 2: Parallel Algorithms and Applications(2012)

Classification of Computational Models (Miura 1980)

• Continuum Model - Fluid Model(Eulerean View) - Discretization of PDEs • Particle Model - Many-body Problems (Lagrangean View) - Discretization of ODEs (e.g., Newton’s Equations) • Structural Model - Discrete Model - Sparse Matrix Formulation • Mathematical Transform - Fourier Transform - Linear algebra

Page 3: Parallel Algorithms and Applications(2012)

High-end simulation in the physical sciences consists of seven algorithms:

1. Structured Grids (including locally structured grids, e.g. AMR)

2. Unstructured Grids 3. Fast Fourier Transform 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo Well-defined targets from algorithmic and software standpoint. Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004

Phillip Colella’s “Seven dwarves(2004)”

Page 4: Parallel Algorithms and Applications(2012)

High-end simulation in the physical sciences consists of thirteen algorithms: 1. Dense Linear Algebra 2. Sparse Linear Algebra 3. Spectral Methods (Fast Fourier Transform) 4. N-Body Methods 5. Structured Grids 6. Unstructured Grids 7. MapReduce (including Monte Carlo Methods) 8. Combinational Logic 9. Graph Traversal 10. Dynamic Programming 11. Backtrack and Branch-and-Bound 12. Graphical Models 13. Finite state machines

Phillip Colella’s “Thirteen Dwarfs(2006)”

Page 5: Parallel Algorithms and Applications(2012)
Page 6: Parallel Algorithms and Applications(2012)

Steps in Conducting Simulations

• Physical Phenomena • Modeling and mathematical formulation • Algorithm selection/development • Programming • Run on hardware platform • Verification of the results

Page 7: Parallel Algorithms and Applications(2012)

• SISD (Sequential Processing) • SIMD (Lock-step,Data Parallel) • MIMD (Control Parallel) • SPMD (A Variation of MIMD; Data Parallel Model on an MMD Machine)

Mike Flynn (1967)

Parallelism from the viewpoint of Computational Models and

Parallelism Description

Page 8: Parallel Algorithms and Applications(2012)

Parallel Programming -A Necessary Evil?-

• It is most ideal to be able to obtain performance (i.e., shorter wall-clock time), without doing anything with the codes ---- Automatic parallelizing compiler

• Why doesn’t it work so easily in many cases? ‐Computational algorithm is inherently serial Examples: Recursive formulation, many branches ‐Algorithm may be parallelizable, but actual implementation of the code is NOT. ‐Employed data structure is not suitable for parallel processing Examples: Stack vs FIFO, Array vs Linked List)

Page 9: Parallel Algorithms and Applications(2012)

Bottlenecks in Parallel Processing

• Overhead in Creating and Finalizing Tasks • Overhead in Synchronization • Significant fraction of non-parallel portion of the code (Amdahl’s Law ) • Overhead in Data Transfer (Latency,Bandwidth for the

Distributed Memory architecture • Memory Contention for Shared Memory architecture • Lack in Load Balancing

Page 10: Parallel Algorithms and Applications(2012)

Parallel Processing and Amdahl’s Law

Can a program be run faster in proportion to the number of processors?

Synchronous Parallel Processing Model Barrier Model Amdahl(1967) Asynchronous Parallel Processing Model Critical Section Model Miura(1991)

Page 11: Parallel Algorithms and Applications(2012)

Synchronous Parallel Processing Model

Barrier Model Gene Amdahl (1967)

Sp(n) = 1/(1-α + α/n) α 1−α

Serial

Parallel

Dr. G.M.Amdahl (2008)

Page 12: Parallel Algorithms and Applications(2012)

Amdahl’s Law

Sp(n) = 1/(1−α + α/n)

Page 13: Parallel Algorithms and Applications(2012)

Asynchronous Parallel Processing Model

Critical Section Model Miura (1991)

P

P

P

P

C

queue

Critical Section

Queuing Model (M/M/1)

Page 14: Parallel Algorithms and Applications(2012)

Asynchronous Case (Miura)

Page 15: Parallel Algorithms and Applications(2012)

Load Imbalance Model

T1

T2 T3

Tn

T= Σ Ti (i=1,......,n) Sp (n) = T/(Max(Ti)) < n Speed-up factor is n, if T1 = T2 =・・・・= Tn

Page 16: Parallel Algorithms and Applications(2012)

Synchronization Overhead Model Load-balance is assumed. Sp(n)=T(1)/T (n) = 1/(1/n + ε n ) =n/(1+ ε n2) (Linear Overhead) Sp(n)=T(1)/T (n) = 1/(1/n + ε log2n ) =n/(1+ ε n log2n ) (Logarithmic Overhead)

Speed-up

0 10 20 30 40 500

5

10

15

20

0 500 1000 1500 2000

20

40

60

80

100

ε =.001

n

n

Speed-up

Page 17: Parallel Algorithms and Applications(2012)

Hockney’s Performance Model (n1/2 Model)

T(n) = TOH + n*τ Total time Overhead execution time or P(n) = Roo n/ (n + n1/2) P(n) = n/T(n) Roo : Peak Performance n1/2: n which gives half of peak performance

Performance Model for Vector and Parallel Supercomputers time

n

TOH

performance

Roo

Roo/2

n n1/2 Note: n refers to the problem size, not the number of processors

Page 18: Parallel Algorithms and Applications(2012)

SIMD Computational Model ( Data Parallel)

• Simple Parallelism(Vector, Matrix) ci=ai+bi,C=A+B

• Reduction s = a1+ a2+ a3+ a4+ • Broadcast ai = s • Shift/Rotate ai= bi-k, • Recurrence ai = ai-1 + bi,← Problem!

Note:Vector processing is also included in this category

Page 19: Parallel Algorithms and Applications(2012)

SIMD Computational Model and

Vector/Parallel Processing

‐ Both vectorization and Parallelization detect identical but independently executable arithmetic operations ‐ Vectorization: Search from Innermost loop outward vs - Parallelization: Search from Outermost loop inward ‐ Same for both cases when partitioning data in the Simple loops or Innermost loops

Page 20: Parallel Algorithms and Applications(2012)

Examples where modification of algorithm is necessary (1)

- Simple Recurrence - Suitable for serial computing (Data locality, Better utilization of memory etc.) Ai = ki Ai-1 Ai = ki ki-1 …… k3 k2 k1 A0

k1 k2 k3 k4 k5 k6 k7 k8 Recursive Doubling

Page 21: Parallel Algorithms and Applications(2012)

ai = ki ai-1 + bi ai = k i bi ai-1

1 0 1 1 ai = MiMi-1…..M2M1 a0

Examples where modification of algorithm is necessary (2) - Linear Recurrence -

Μ1 Μ2 Μ3 Μ4 Μ5 Μ6 Μ7 Μ8

Page 22: Parallel Algorithms and Applications(2012)

Ki-1,j-1 Ci-1,j-1 +Ki-1,j+1 Ci-1,j+1 +Ki,jCi,j + Ki+1,j-1 Ci+1,j-1 + K i+1,j+1 C i+1,j+1 = d i

Examples of Recurrence Formula for Iterative Methods

Ki-1 Ci-1 +Ki Ci + K i+1 C i+1 = d i

One-Dimensional Case

Two- Dimensional Case

Page 23: Parallel Algorithms and Applications(2012)

Cyclic Reduction for Tridiagonal Equations

b1 c1 a2 b2 c2 a3 b3 c3

x1 x2 x3

=

k1 k2 k3

0

0

*

Page 24: Parallel Algorithms and Applications(2012)

Cyclic Reduction(Serial)

Page 25: Parallel Algorithms and Applications(2012)

Cyclic Reduction(Parallel)

Page 26: Parallel Algorithms and Applications(2012)

Random Number Generation Algorithms (1) Linear Congruencial Method:X n = (a X n-1 + c) Mod M Multiplicative : c=0. → Period = 2j-2 Mixed : c .ne. 0. → Period = 2j where M=2j (usually Machine Word Size) (2) Binary M-sequence with Primitive Trinomial: X n = (Xn-m .eor. Xn-k ) Mod 2 → Period = 2k-1 (m<k) (3) Generalized Fibbonacci Method with Primitive Trinomial: X n = (Xn-m op. Xn-k ) Mod M → Period =(2k-1)2j-1~ 2 k+j-1

where op. is {+, ‐, *}. (4) Generalized Recurrence Method With Large Prime Modulus (Multiple Recursive Generator or MRG) X n = (a1 Xn-1 + a2 Xn-2 + a3 Xn-3 +…..+ak Xn-k) Mod p → Period =pk-1~ 2 j*k ,where p ~ 2 j is a Prime Number (e.g., 2 31 – 1)

Page 27: Parallel Algorithms and Applications(2012)

Consider p=231 – 1 and k = 8. f(x) = (x8- a1x7 – a2x6 – a3x5 – a4x4 – a5x3 – a6x2 – a7x – a8) mod ( p )

- Good Lattice Structure with Full Coefficients - Simple and Fast Implementation with 64 bit Arithmetic, when modulus p = 231 - 1 - Long Period: (231-1)8 – 1 ~ 4.5*1074

- % of Primitive Polynomials: φ (pk – 1)/k/(pk – 1) = 2.2% - Easy to Extend for Vector/Parallel Processing

An Example of MRG with a 8th-order Full Primitive Polynomial

Page 28: Parallel Algorithms and Applications(2012)

Vectorization and Parallelization of MRG - Decimating the Sequence -

a1 a2 a3 a4 a5 a6 a7 a8 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 A = 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 ,

xn-1

xn-2

xn-3 x = xn-4

xn-5

xn-6

xn-7 xn-8 ,

Compute A2, A4, A8,……….Mod(p) , once. Obtain the new polynomial by multiplying the matrices.

xn

xn-1

xn-2 x’ = xn-3

xn-4

xn-5

xn-6 xn-7

Then x’ = A x Mod(p).

Transfer Matrix:

Page 29: Parallel Algorithms and Applications(2012)

Vectorization and Parallelization of MRG (Continued)

In order to compute xn = An x0 mod(p): (1) Store Aj = A2 j mod(p)

(j=0,1,2,3,4,……).

(2) Represent n in the binary form, e.g., (bm-1,…..,b2,b1,b0). (3) Multiply I by Aj mod(p)

when bj=1 (j=0,….,m-1),

where I is a k-th order identity matrix. Note: The same strategy works with the polynomial arithmetic (Knuth).

Page 30: Parallel Algorithms and Applications(2012)

SPMD Computational Model -Application of Data Parallel to MIMD Architecture-

• A “SINGLE” Program for all processors • Multiple instruction streams at Execution time • Each processor takes care of a portion of large-

sized data. Its behavior is similar to SIMD, but is allowed locally independent operations with, say, conditional branches

Necessity for various synchronization mechanisms and their description

Page 31: Parallel Algorithms and Applications(2012)

MIMD Computational Model -Control Parallel or Task Parallel-

• Master ‐Slave type operations • Fork, Join Construct • Synchronization mechanisms and their description (Barrier,Semaphor,Lock/Unlock) • Data Transfer( in case of Distributed Memory) (Send/Receive protocol) Example: Event Parallel Transport Monte Carlo Simulation, Ray Tracing

Page 32: Parallel Algorithms and Applications(2012)

Parallel Prefix

Source: L.Snyder “Paralle Progrmming”

Page 33: Parallel Algorithms and Applications(2012)

Three Models of Dense Matrix Multiplications

• Inner Product Do i = 1, n Do j = 1, n Do k = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

Cij = Σ Aik * Bkj

• Middle Product Do j = 1, n Do k = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

• Outer Product Do k = 1, n Do j = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

Page 34: Parallel Algorithms and Applications(2012)

Three Models of Dense Matrix Multiplications (1)

• Inner Product Do i = 1, n Do j = 1, n sum=0 Do k = 1, n sum = sum+ A(i,k)*B(k,j) enddo C(i,j) = sum enddo enddo

= *

Cij = Σ Aik * Bkj k

Page 35: Parallel Algorithms and Applications(2012)

Three Models of Dense Matrix Multiplications (2)

• Middle Product Do j = 1, n Do k = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

= * +

Cij = Σ Aik * Bkj k

Page 36: Parallel Algorithms and Applications(2012)

Three Models of Dense Matrix Multiplications(3)

• Outer Product Do k = 1, n Do j = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo

= * +

Cij = Σ Aik * Bkj k

Page 37: Parallel Algorithms and Applications(2012)

Strassen’s Algorithm for Matrix Multiply

• Multiplication of two matrices is one of the most basic operations of linear algebra and scientific computing.

• Conventional standard algorithm for n x n matrices requires Ο(n3) operations.

• Strassen’s algorithm, introduced 1969, has maximum operations of Ο(nlog2(7)) ≈ Ο(n2.807).

• Another Divide-and-Conquer approach.

Reference: 1. V. Strassen, “Gaussian Elimination is Not Optimal.” Journal of Numerical Mathematics, 13:354-356

2. S. Huss-Lederman etc. “Implementation of Strassen’s Algorithm for Matrix Multiplication”, SC96 Technical Paper.

Page 38: Parallel Algorithms and Applications(2012)

Conventional Matrix Multiply

No. of scalar multiplication = n3

No. of scalar addition = n3 - n2

No. of total arithmetic operations = 2 n3 - n2

A,B,C are n by n matrices.

C = A * B

Page 39: Parallel Algorithms and Applications(2012)

Conventional Matrix Multiply (with submatrices)

C11

C21

C12

C22

A11

A21

A12

A22

B11

B21

B12

B22 = *

C11 = A11 B11 + A12 B21 C12 = A11 B12 + A12 B22

C21 = A21 B11 + A22 B21 C22 = A21 B12 + A22 B22

No. of matrix multiplications = 8

No. of matrix additions = 4

Total arithmetic operations = 8(2 (n/2)3 – (n/2)2) + 4 (n/2)2 = 2 n3 - n2

Page 40: Parallel Algorithms and Applications(2012)

Strassen’s Algorithm - 1/3

• Strassen’s method has fewer multiply and offsets by more additions and subtractions.

• For each pair of sub-matrices, there are 7 multiplications and 18 additions/subtractions.

• Among these operations, 7 multiplications and 10 additions/subtractions are in steps of calculating P’s.

= * C11 C21

C12 C22

A11 A21

A12 A22

B11 B21

B12 B22

C11 = P1 + P4 - P5 + P7 C12 = P3 + P5 C21 = P2 + P4 C22 = P1 + P3 - P2 + P6

P1 = ( A11 + A22 )( B11 + B22 ) P5 = (A11 + A12 ) B22 P2 = ( A21 + A22 ) B11 P6 = (A21 - A11 )( B11 + B12 ) P3 = A11 ( B12 - B22 ) P7 = (A12 - A22 )( B21 + B22 ) P4 = A22 ( B21 - B11 )

Page 41: Parallel Algorithms and Applications(2012)

Strassen’s Algorithm - 2/3

• On 2 x 2 matrices, the count of arithmetic operations is:

Mult Add Complexity Conventional 8 4 16n3 - 4n2 Strassen 7 18 14n3 + 11n2 • On matrix multiply is replaced by 14 matrix

additions.

Page 42: Parallel Algorithms and Applications(2012)

Strassen’s Algorithm - 3/3 • In one level of Strassen’s algorithm applied to 2 x 2

matrices with elements of (n/2 )x (n/2) blocks and the conventional algorithm is used for the seven block matrix multiplications, the total number of operation count is

7(2(n/2)3 - (n/2)2) + 18(n/2)2 = (7/4)n3 + (11/4)n2

R = Strassen Operation Count

Conventional Operation Count = 7n3 + 11n2

8n3 - 4n2

lim R = 7/8 n→∞ 12.5% improvement!

Note: code can be accessed from:http://www-unix.mcs.anl.gov/prism/lib/software.html

Page 43: Parallel Algorithms and Applications(2012)

Winograd’s Variant of Strassen’s Algorithm

• Based on Strassen’s algorithm, Winograd reduced 3 of additions/subtractions by rearranging the order of calculation into 4 stages.

S1 = A21 + A22 T1 = B12 - B11 P1 = A11 B11 U1 = P1 + P2 S2 = S1 - A11 T2 = B22 - T1 P2 = A12 B21 U2 = P1 + P4 S3 = A11 - A21 T3 = B22 - B12 P3= S1 T1 U3 = U2 + P5 S4 = A12 - S2 T4 = B21 - T2 P4 = S2 T2 U4 = U3 + P7 P5 = S3 T3 U5 = U3 + P3 P6 = S4 B22 U6 = U2 + P3 P7 = A22 T4 U7 = U6 + P6

Stage 1 Stage 2 Stage 3 Stage 4

C11 = U1 , C12 = U7 , C21 = U4 , C22 = U5

7 multipy, 15 add./sub.

( Takes 7k multiplications and 5(7k – 4k) additions to Multiply 2k x 2k matirices.)

Page 44: Parallel Algorithms and Applications(2012)

Usage of Strassen’s Algorithm

• Usage of Strassen’s algorithm is limited by – Additional memory is required to store matrices P’s. – More memory traffic are necessary, memory bandwidth plays a key role.

• Loss of significance in Strassen’s algorithm: – Caused by adding relatively large and very small numbers.

Performance Example (over Cray Library MXM on Cray2 ) n=64 => x 1.35 n=2048 => x 2.01 (1988 by David Bailey)

Page 45: Parallel Algorithms and Applications(2012)

Fourier Transform and FFT

• Discrete Fourier Transform(DFT)

Zi =1/n Σ ωi k * Xk where ω = exp(-2πj/n)

O(n2) → O(log n)

Page 46: Parallel Algorithms and Applications(2012)

Fourier Transform and FFT

• Butterfly operation (DIT:Decimation in Time)

Z = X + αnk ∗ Y

W = X - αnk * Y

where αn = exp(2πj/n)

X

Y

Z

W

+

− ∗

αn

Page 47: Parallel Algorithms and Applications(2012)

Fourier Transform and FFT

• Butterfly operation (DIF: Decimation in Frequency)

Z = X + Y W = (X – Y)∗αn

k

where αn = exp(2πj/n)

X

Y

Z

W

+

− ∗

αn

Page 48: Parallel Algorithms and Applications(2012)

Number of Operations for Complex Fourier Transform ( N:power of 2)

FFT per butterfly : add/sub= 6, Mult = 4 N points: Total ops. =5 N log2 N DFT (Complex Matrix-Vector multiplication) 1 point: add/sub = 4 N-2, Mult = 4N N points: Total ops. = 8N2 – 2N = 2 N(4 N-1)

If the efficiency of FFT is 1%, 512 point DFT is faster than FFT If the efficiency of FFT is 3%, 128 point DFT is faster than FFT If the efficiency of FFT is 5%, 64 point DFT is faster than FFT

Page 49: Parallel Algorithms and Applications(2012)

Fast Fourier Transform

Page 50: Parallel Algorithms and Applications(2012)

FFT (Isogeometric)

Page 51: Parallel Algorithms and Applications(2012)

FFT (Self-sorting variant due to Stockham)

Page 52: Parallel Algorithms and Applications(2012)

Applications

Page 53: Parallel Algorithms and Applications(2012)

Concept of Numerical Weather Prediction (Richardson, 1922)

Page 54: Parallel Algorithms and Applications(2012)
Page 55: Parallel Algorithms and Applications(2012)

Simulation of Precipitation

Page 56: Parallel Algorithms and Applications(2012)
Page 57: Parallel Algorithms and Applications(2012)

Simulation of typhoon

Source: Earth Simulator Center

Page 58: Parallel Algorithms and Applications(2012)

Simulation of Tsunami (March 11, 2011)

Source: Prof. Imamura, Tohoku Univ.

Page 59: Parallel Algorithms and Applications(2012)
Page 60: Parallel Algorithms and Applications(2012)
Page 61: Parallel Algorithms and Applications(2012)
Page 62: Parallel Algorithms and Applications(2012)

Seismic Data Processing for Oil Exploration

Page 63: Parallel Algorithms and Applications(2012)
Page 64: Parallel Algorithms and Applications(2012)
Page 65: Parallel Algorithms and Applications(2012)

Applications Areas for Petaflops • Computational testing and simulation as

a replacement for weapons testing (stockpile stewardship)

• Simulation of plasma fusion devices and basic physics for controlled fusion (to optimize design of future reactors)

• Design of new chemical compounds and synthesis pathways (environmental safety and cost improvements)

• Comprehensive modeling of groundwater and oil reservoirs (contamination and management)

• Modeling of complex transportation, communication and economic systems

• Time dependent simulations of complex biomolecules (membranes, synthesis machinery and dna)

• Multidisciplinary optimization problems combining structures, fluids and geometry

• Modeling of integrated earth systems (ocean, atmosphere, bio-geosphere)

• Improved 4d/6d data assimilation capability applied to remote sensing and environmental models

• Computational cosmology (integration of particle models, astrophysical fluids and radiation transport)

• Materials simulations that bridge the gap between microscale and macroscale (bulk materials)

• Coupled electro-mechanical simulations of nano-scale structures (dynamics and mechanics of micromachines)

• Full plant optimization for complex processes (chemical, manufacturing and assembly problems)

• High-resolution reacting flow problems (combustion, chemical mixing and multiphase flow)

• High-realism immersive virtual reality based on real-time radiosity modeling and complex scenes

Green, Blue, Red

6/15, 6/15, 3/15


Recommended