Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Parallel Programmingin C with MPI and OpenMP

Michael J. QuinnMichael J. Quinn


Chapter 12

Solving Linear SystemsSolving Linear Systems


Outline

TerminologyTerminology Back substitutionBack substitution Gaussian eliminationGaussian elimination Jacobi methodJacobi method Conjugate gradient methodConjugate gradient method


Terminology

System of linear equationsSystem of linear equations Solve Solve Ax Ax = = bb for for xx

Special matricesSpecial matrices Symmetrically bandedSymmetrically banded Upper triangularUpper triangular Lower triangularLower triangular Diagonally dominantDiagonally dominant SymmetricSymmetric


Symmetrically Banded

4 2 -1 0 0 0

3 -4 5 6 0 0

1 6 3 2 4 0

0 2 -2 0 9 2

0 0 7 3 8 7

0 0 0 4 0 2

Semibandwidth 2


Upper Triangular

4 2 -1 5 9 2

0 -4 5 6 0 -4

0 0 3 2 4 6

0 0 0 0 9 2

0 0 0 0 8 7

0 0 0 0 0 2


Lower Triangular

4 0 0 0 0 0

0 0 0 0 0 0

5 4 3 0 0 0

2 6 2 3 0 0

8 -2 0 1 8 0

-3 5 7 9 5 2


Diagonally Dominant

19 0 2 2 0 6

0 -15 2 0 -3 0

5 4 22 -1 0 4

2 3 2 13 0 -5

5 -2 0 1 16 0

-3 5 5 3 5 -32


Symmetric

3 0 2 2 0 6

0 7 4 3 -3 5

5 4 0 -1 0 4

2 3 -1 9 0 -5

0 -3 0 0 5 5

6 5 4 -5 5 -3


Back Substitution

Used to solve upper triangular systemUsed to solve upper triangular systemTxTx = = bb for for xx

Methodology: one element of Methodology: one element of xx can be can be immediately computedimmediately computed

Use this value to simplify system, revealing Use this value to simplify system, revealing another element that can be immediately another element that can be immediately computedcomputed

RepeatRepeat


Back Substitution

1x0 +1x1 –1x2 +4x3 8=

– 2x1 –3x2 +1x3 5=

2x2 – 3x3 0=

2x3 4=


Back Substitution

1x0 +1x1 –1x2 +4x3 8=

– 2x1 –3x2 +1x3 5=

2x2 – 3x3 0=

2x3 4=x3 = 2


Back Substitution

1x0 +1x1 –1x2 0=

– 2x1 –3x2 3=

2x2 6=

2x3 4=


Back Substitution

1x0 +1x1 –1x2 0=

– 2x1 –3x2 3=

2x2 6=

2x3 4=x2 = 3


Back Substitution

1x0 +1x1 3=

– 2x1 12=

2x2 6=

2x3 4=


Back Substitution

1x0 +1x1 3=

– 2x1 12=

2x2 6=

2x3 4=x1 = –6


Back Substitution

1x0 9=

– 2x1 12=

2x2 6=

2x3 4=


Back Substitution

1x0 9=

– 2x1 12=

2x2 6=

2x3 4=x0 = 9


Pseudocodefor i n 1 down to 1 do

x [ i ] b [ i ] / a [ i, i ]for j 0 to i 1 do

b [ j ] b [ j ] x [ i ] × a [ j, i ]endfor

endfor

Time complexity: (n2)


Data Dependence Diagram

We cannot execute the outer loop in parallel.We can execute the inner loop in parallel.


Row-oriented Algorithm

Associate primitive task with each row of Associate primitive task with each row of AA and corresponding elements of and corresponding elements of xx and and bb

During iteration During iteration ii task associated with row task associated with row jj computes new value of computes new value of bbjj

Task Task ii must compute must compute xxii and broadcast its and broadcast its valuevalue

Agglomerate using rowwise interleaved Agglomerate using rowwise interleaved striped decompositionstriped decomposition


Interleaved Decompositions

Rowwise interleavedstriped decomposition

Columnwise interleavedstriped decomposition


Complexity Analysis

Each process performs about Each process performs about n n / (2/ (2pp) ) iterations of loop iterations of loop j j in allin all

A total of A total of n n -1 iterations in all-1 iterations in all Computational complexity: Computational complexity: ((nn22//pp)) One broadcast per iterationOne broadcast per iteration Communication complexity: Communication complexity: ((nn loglog p p))


Column-oriented Algorithm

Associate one primitive task per column of Associate one primitive task per column of AA and associated element of and associated element of xx

Last task starts with vector Last task starts with vector bb During iteration During iteration ii task task ii computes computes xxii, ,

updates updates bb, and sends , and sends bb to task to task i i -1-1 In other words, no computational In other words, no computational

concurrencyconcurrency Agglomerate tasks in interleaved fashionAgglomerate tasks in interleaved fashion


Complexity Analysis

Since Since bb always updated by a single process, always updated by a single process, computational complexity same as computational complexity same as sequential algorithm: sequential algorithm: ((nn22))

Since elements of Since elements of bb passed from one passed from one process to another each iteration, process to another each iteration, communication complexity is communication complexity is ((nn22))


Comparison

p

n

C o lu m n - o r ien teda lg o r ith m s u p er io r

R o w - o r ien teda lg o r ith m s u p er io r

2

Message-passing timedominates

Computationtime dominates


Gaussian Elimination

Used to solve Used to solve AxAx = = bb when when AA is dense is dense Reduces Reduces AxAx = = bb to upper triangular system to upper triangular system

TxTx = = cc Back substitution can then solve Back substitution can then solve TxTx = = cc

for for xx



4x0 +6x1 +2x2 – 2x3 = 8

2x0 +5x2 – 2x3 = 4

–4x0 – 3x1 – 5x2 +4x3 = 1

8x0 +18x1 – 2x2 +3x3 = 40



4x0 +6x1 +2x2 – 2x3 = 8

+4x2 – 1x3 = 0

+3x1 – 3x2 +2x3 = 9

+6x1 – 6x2 +7x3 = 24

– 3x1



4x0 +6x1 +2x2 – 2x3 = 8

+4x2 – 1x3 = 0

1x2 +1x3 = 9

2x2 +5x3 = 24

– 3x1



4x0 +6x1 +2x2 – 2x3 = 8

+4x2 – 1x3 = 0

1x2 +1x3 = 9

3x3 = 6

– 3x1


Iteration of Gaussian Elimination

E lem en ts th a t w ill n o t b e c h an g ed

E lem en ts th a t w ill b e c h an g ed

P iv o t r o w

E lem en ts a lr ead y d r iv en to 0

i

i


Numerical Stability Issues

If pivot element close to zero, significant If pivot element close to zero, significant roundoff errors can resultroundoff errors can result

Gaussian elimination with partial pivoting Gaussian elimination with partial pivoting eliminates this problemeliminates this problem

In step In step ii we search rows we search rows ii through through nn-1 for -1 for the row whose column the row whose column ii element has the element has the largest absolute valuelargest absolute value

Swap (pivot) this row with row Swap (pivot) this row with row ii


Implementing Partial Pivoting

Without partial pivoting With partial pivoting


Row-oriented Parallel Algorithm

Associate primitive task with each row of Associate primitive task with each row of AA and and corresponding elements of corresponding elements of xx and and bb

A kind of reduction needed to find the identity of A kind of reduction needed to find the identity of the pivot rowthe pivot row

Tournament: want to determine identity of row Tournament: want to determine identity of row with largest value, rather than largest value itselfwith largest value, rather than largest value itself

Could be done with two all-reductionsCould be done with two all-reductions MPI provides a simpler, faster mechanismMPI provides a simpler, faster mechanism


MPI_MAXLOC, MPI_MINLOC

MPI provides reduction operators MPI provides reduction operators MPI_MAXLOC, MPI_MINLOCMPI_MAXLOC, MPI_MINLOC

Provide datatype representing a (value, Provide datatype representing a (value, index) pairindex) pair


MPI (value,index) DatatypesMPI_DatatypeMPI_Datatype MeaningMeaning

MPI_2INTMPI_2INT Two intsTwo ints

MPI_DOUBLE_INTMPI_DOUBLE_INT A double followed by an intA double followed by an int

MPI_FLOAT_INTMPI_FLOAT_INT A float followed by an intA float followed by an int

MPI_LONG_INTMPI_LONG_INT A long followed by an intA long followed by an int

MPI_LONG_DOUBLE_INTMPI_LONG_DOUBLE_INT A long double followed by A long double followed by an intan int

MPI_SHORT_INTMPI_SHORT_INT A short followed by an intA short followed by an int


Example Use of MPI_MAXLOC

struct { double value; int index;} local, global;...local.value = fabs(a[j][i]);local.index = j;...MPI_Allreduce (&local, &global, 1, MPI_DOUBLE_INT, MPI_MAXLOC, MPI_COMM_WORLD);


Second Communication per Iteration

j

i

p ick ed

k

a [p icked ] [k]

a [p icked ] [i]

a [j] [i]

a [j] [k]


Communication Complexity

Complexity of tournament: Complexity of tournament: (log (log pp)) Complexity of broadcasting pivot row:Complexity of broadcasting pivot row:

((nn log log pp)) A total of A total of nn - 1 iterations - 1 iterations Overall communication complexity:Overall communication complexity:

((nn2 2 log log pp))


Isoefficiency Analysis

Communication overhead: Communication overhead: ((nn22 p p log log pp)) Sequential algorithm has time complexity Sequential algorithm has time complexity

((nn33)) Isoefficiency relationIsoefficiency relation

nn33 CnCn22 p p log log pp nn C pC p log log pp

This system has poor scalabilityThis system has poor scalability

ppCpppCppCpM 22222 log/log/)log(


Column-oriented Algorithm

Associate a primitive task with each column of Associate a primitive task with each column of AA and another primitive task for and another primitive task for bb

During iteration During iteration ii task controlling column task controlling column ii determines pivot row and broadcasts its identitydetermines pivot row and broadcasts its identity

During iteration During iteration ii task controlling column task controlling column ii must must also broadcast column also broadcast column ii to other tasks to other tasks

Agglomerate tasks in an interleaved fashion to Agglomerate tasks in an interleaved fashion to balance workloadsbalance workloads

Isoefficiency same as row-oriented algorithmIsoefficiency same as row-oriented algorithm


Comparison of Two Algorithms

Both algorithms evenly divide workloadBoth algorithms evenly divide workload Both algorithms do a broadcast each iterationBoth algorithms do a broadcast each iteration Difference: identification of pivot rowDifference: identification of pivot row

Row-oriented algorithm does search in parallel Row-oriented algorithm does search in parallel but requires all-reduce stepbut requires all-reduce step

Column-oriented algorithm does search Column-oriented algorithm does search sequentially but requires no communicationsequentially but requires no communication

Row-oriented superior when Row-oriented superior when nn relatively larger relatively larger and and pp relatively smaller relatively smaller


Problems with These Algorithms

They break parallel execution into They break parallel execution into computation and communication phasescomputation and communication phases

Processes not performing computations Processes not performing computations during the broadcast stepsduring the broadcast steps

Time spent doing broadcasts is large Time spent doing broadcasts is large enough to ensure poor scalabilityenough to ensure poor scalability


Pipelined, Row-Oriented Algorithm

Want to overlap communication time with Want to overlap communication time with computation timecomputation time

We could do this if we knew in advance the We could do this if we knew in advance the row used to reduce all the other rows.row used to reduce all the other rows.

Let’s pivot columns instead of rows!Let’s pivot columns instead of rows! In iteration In iteration ii we can use row we can use row i i to reduce the to reduce the

other rows.other rows.


Communication Pattern

0

2

13

Row 0

Reducing UsingRow 0



0

2

13

Reducing UsingRow 0

Reducing UsingRow 0

Row 0



0

2

13

Reducing UsingRow 0

Reducing UsingRow 0

Reducing UsingRow 0

Row 0



0

2

13

Reducing UsingRow 0

Reducing UsingRow 0

Reducing UsingRow 0

Reducing UsingRow 0



0

2

13 Reducing UsingRow 0

Reducing UsingRow 0

Reducing UsingRow 0



0

2


Reducing UsingRow 0

Reducing UsingRow 0

Row 1



0

2


Reducing UsingRow 1

Reducing UsingRow 0

Row 1



0

2


Reducing UsingRow 1

Reducing UsingRow 1

Row 1



0

2

13

Reducing UsingRow 1

Reducing UsingRow 1

Reducing UsingRow 1

Reducing UsingRow 1


Analysis (1/2)

Total computation time: Total computation time: ((nn33//pp)) Total message transmission time: Total message transmission time: ((nn22)) When When nn large enough, message large enough, message

transmission time completely overlapped by transmission time completely overlapped by computation timecomputation time

Message start-up not overlapped: Message start-up not overlapped: ((nn)) Parallel overhead: Parallel overhead: ((npnp))


Analysis (2/2)

Isoefficiency relation:Isoefficiency relation:

Scalability function:Scalability function:

Parallel system is perfectly scalableParallel system is perfectly scalable

CpnCnpn 3 CpnCnpn 3

CpCppCpM //)(


Sparse Systems

Gaussian elimination not well-suited for Gaussian elimination not well-suited for sparse systemssparse systems

Coefficient matrix gradually fills with Coefficient matrix gradually fills with nonzero elementsnonzero elements

ResultResult Increases storage requirementsIncreases storage requirements Increases total operation countIncreases total operation count


Example of “Fill”


Iterative Methods

Iterative method: algorithm that generates a Iterative method: algorithm that generates a series of approximations to solution’s valueseries of approximations to solution’s value

Require less storage than direct methodsRequire less storage than direct methods Since they avoid computations on zero Since they avoid computations on zero

elements, they can save a lot of elements, they can save a lot of computationscomputations


Jacobi Method

ij

kjjiia

ki xabx

ii)( ,

11

,

ij

kjjiia

ki xabx

ii)( ,

11

,

Values of elements of vector x at iteration k+1depend upon values of vector x at iteration k

Gauss-Seidel method: Use latest versionavailable of xi


Jacobi Method Iterations4

3

2

1

1 2 3 40x

x3

1x

2x

4x


Rate of Convergence

Even when Jacobi method and Gauss-Seidel Even when Jacobi method and Gauss-Seidel methods converge on solution, rate of methods converge on solution, rate of convergence often too slow to make them convergence often too slow to make them practicalpractical

We will move on to an iterative method We will move on to an iterative method with much faster convergencewith much faster convergence


Conjugate Gradient Method

AA is positive definite if for every nonzero vector x is positive definite if for every nonzero vector x and its transpose and its transpose xxTT, the product , the product xxTTAxAx > 0 > 0

If If AA is symmetric and positive definite, then the is symmetric and positive definite, then the functionfunction

has a unique minimizer that is solution to has a unique minimizer that is solution to AxAx = = bb Conjugate gradient is an iterative method that Conjugate gradient is an iterative method that

solves solves AxAx = = bb by minimizing by minimizing q(x)q(x)

cbxAxxxq TT 21)(


Conjugate Gradient Convergence4

3

2

1

1 2 3 40x

1x 2

x Finds value ofn-dimensional solutionin at most n iterations


Conjugate Gradient Computations

Matrix-vector multiplicationMatrix-vector multiplication Inner product (dot product)Inner product (dot product) Matrix-vector multiplication has higher Matrix-vector multiplication has higher

time complexitytime complexity Must modify previously developed Must modify previously developed

algorithm to account for sparse matricesalgorithm to account for sparse matrices


Rowwise Block Striped Decomposition of a Symmetrically Banded Matrix

Matrix

Decomposition


Representation of Vectors

Replicate vectorsReplicate vectors Need all-gather step after matrix-vector Need all-gather step after matrix-vector

multiplymultiply Inner product has time complexity Inner product has time complexity ((nn))

Block decomposition of vectorsBlock decomposition of vectors Need all-gather step before matrix-vector Need all-gather step before matrix-vector

multiplymultiply Inner product has time complexityInner product has time complexity

((n/p + n/p + loglog p p))


Comparison of Vector Decompositions

R ep lic a tedVec to r sS u p er io r

Blo c kD ec o m p o s it io nS u p er io r

p

n


Summary (1/2)

Solving systems of linear equationsSolving systems of linear equations Direct methodsDirect methods Iterative methodsIterative methods

Parallel designs forParallel designs for Back substitutionBack substitution Gaussian eliminationGaussian elimination Conjugate gradient methodConjugate gradient method


Summary (2/2)

Superiority of one algorithm over another Superiority of one algorithm over another depends on size of problem, number of depends on size of problem, number of processors, characteristics of parallel processors, characteristics of parallel computercomputer

Overlapping communications with Overlapping communications with computations can be key to scalabilitycomputations can be key to scalability

Date post:	05-Jan-2016
Category:	Documents
Upload:	arlen
View:	44 times
Download:	1 times

Parallel Programming in C with MPI and OpenMP

Documents