Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Programmingin C with MPI and OpenMP
Michael J. QuinnMichael J. Quinn
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 12
Solving Linear SystemsSolving Linear Systems
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Outline
TerminologyTerminology Back substitutionBack substitution Gaussian eliminationGaussian elimination Jacobi methodJacobi method Conjugate gradient methodConjugate gradient method
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Terminology
System of linear equationsSystem of linear equations Solve Solve Ax Ax = = bb for for xx
Special matricesSpecial matrices Symmetrically bandedSymmetrically banded Upper triangularUpper triangular Lower triangularLower triangular Diagonally dominantDiagonally dominant SymmetricSymmetric
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Symmetrically Banded
4 2 -1 0 0 0
3 -4 5 6 0 0
1 6 3 2 4 0
0 2 -2 0 9 2
0 0 7 3 8 7
0 0 0 4 0 2
Semibandwidth 2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Upper Triangular
4 2 -1 5 9 2
0 -4 5 6 0 -4
0 0 3 2 4 6
0 0 0 0 9 2
0 0 0 0 8 7
0 0 0 0 0 2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Lower Triangular
4 0 0 0 0 0
0 0 0 0 0 0
5 4 3 0 0 0
2 6 2 3 0 0
8 -2 0 1 8 0
-3 5 7 9 5 2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Diagonally Dominant
19 0 2 2 0 6
0 -15 2 0 -3 0
5 4 22 -1 0 4
2 3 2 13 0 -5
5 -2 0 1 16 0
-3 5 5 3 5 -32
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Symmetric
3 0 2 2 0 6
0 7 4 3 -3 5
5 4 0 -1 0 4
2 3 -1 9 0 -5
0 -3 0 0 5 5
6 5 4 -5 5 -3
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Back Substitution
Used to solve upper triangular systemUsed to solve upper triangular systemTxTx = = bb for for xx
Methodology: one element of Methodology: one element of xx can be can be immediately computedimmediately computed
Use this value to simplify system, revealing Use this value to simplify system, revealing another element that can be immediately another element that can be immediately computedcomputed
RepeatRepeat
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Back Substitution
1x0 +1x1 –1x2 +4x3 8=
– 2x1 –3x2 +1x3 5=
2x2 – 3x3 0=
2x3 4=
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Back Substitution
1x0 +1x1 –1x2 +4x3 8=
– 2x1 –3x2 +1x3 5=
2x2 – 3x3 0=
2x3 4=x3 = 2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Back Substitution
1x0 +1x1 –1x2 0=
– 2x1 –3x2 3=
2x2 6=
2x3 4=
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Back Substitution
1x0 +1x1 –1x2 0=
– 2x1 –3x2 3=
2x2 6=
2x3 4=x2 = 3
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Back Substitution
1x0 +1x1 3=
– 2x1 12=
2x2 6=
2x3 4=
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Back Substitution
1x0 +1x1 3=
– 2x1 12=
2x2 6=
2x3 4=x1 = –6
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Back Substitution
1x0 9=
– 2x1 12=
2x2 6=
2x3 4=
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Back Substitution
1x0 9=
– 2x1 12=
2x2 6=
2x3 4=x0 = 9
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pseudocodefor i n 1 down to 1 do
x [ i ] b [ i ] / a [ i, i ]for j 0 to i 1 do
b [ j ] b [ j ] x [ i ] × a [ j, i ]endfor
endfor
Time complexity: (n2)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Data Dependence Diagram
We cannot execute the outer loop in parallel.We can execute the inner loop in parallel.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Row-oriented Algorithm
Associate primitive task with each row of Associate primitive task with each row of AA and corresponding elements of and corresponding elements of xx and and bb
During iteration During iteration ii task associated with row task associated with row jj computes new value of computes new value of bbjj
Task Task ii must compute must compute xxii and broadcast its and broadcast its valuevalue
Agglomerate using rowwise interleaved Agglomerate using rowwise interleaved striped decompositionstriped decomposition
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Interleaved Decompositions
Rowwise interleavedstriped decomposition
Columnwise interleavedstriped decomposition
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Complexity Analysis
Each process performs about Each process performs about n n / (2/ (2pp) ) iterations of loop iterations of loop j j in allin all
A total of A total of n n -1 iterations in all-1 iterations in all Computational complexity: Computational complexity: ((nn22//pp)) One broadcast per iterationOne broadcast per iteration Communication complexity: Communication complexity: ((nn loglog p p))
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Column-oriented Algorithm
Associate one primitive task per column of Associate one primitive task per column of AA and associated element of and associated element of xx
Last task starts with vector Last task starts with vector bb During iteration During iteration ii task task ii computes computes xxii, ,
updates updates bb, and sends , and sends bb to task to task i i -1-1 In other words, no computational In other words, no computational
concurrencyconcurrency Agglomerate tasks in interleaved fashionAgglomerate tasks in interleaved fashion
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Complexity Analysis
Since Since bb always updated by a single process, always updated by a single process, computational complexity same as computational complexity same as sequential algorithm: sequential algorithm: ((nn22))
Since elements of Since elements of bb passed from one passed from one process to another each iteration, process to another each iteration, communication complexity is communication complexity is ((nn22))
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Comparison
p
n
C o lu m n - o r ien teda lg o r ith m s u p er io r
R o w - o r ien teda lg o r ith m s u p er io r
2
Message-passing timedominates
Computationtime dominates
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gaussian Elimination
Used to solve Used to solve AxAx = = bb when when AA is dense is dense Reduces Reduces AxAx = = bb to upper triangular system to upper triangular system
TxTx = = cc Back substitution can then solve Back substitution can then solve TxTx = = cc
for for xx
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gaussian Elimination
4x0 +6x1 +2x2 – 2x3 = 8
2x0 +5x2 – 2x3 = 4
–4x0 – 3x1 – 5x2 +4x3 = 1
8x0 +18x1 – 2x2 +3x3 = 40
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gaussian Elimination
4x0 +6x1 +2x2 – 2x3 = 8
+4x2 – 1x3 = 0
+3x1 – 3x2 +2x3 = 9
+6x1 – 6x2 +7x3 = 24
– 3x1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gaussian Elimination
4x0 +6x1 +2x2 – 2x3 = 8
+4x2 – 1x3 = 0
1x2 +1x3 = 9
2x2 +5x3 = 24
– 3x1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Gaussian Elimination
4x0 +6x1 +2x2 – 2x3 = 8
+4x2 – 1x3 = 0
1x2 +1x3 = 9
3x3 = 6
– 3x1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Iteration of Gaussian Elimination
E lem en ts th a t w ill n o t b e c h an g ed
E lem en ts th a t w ill b e c h an g ed
P iv o t r o w
E lem en ts a lr ead y d r iv en to 0
i
i
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Numerical Stability Issues
If pivot element close to zero, significant If pivot element close to zero, significant roundoff errors can resultroundoff errors can result
Gaussian elimination with partial pivoting Gaussian elimination with partial pivoting eliminates this problemeliminates this problem
In step In step ii we search rows we search rows ii through through nn-1 for -1 for the row whose column the row whose column ii element has the element has the largest absolute valuelargest absolute value
Swap (pivot) this row with row Swap (pivot) this row with row ii
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Implementing Partial Pivoting
Without partial pivoting With partial pivoting
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Row-oriented Parallel Algorithm
Associate primitive task with each row of Associate primitive task with each row of AA and and corresponding elements of corresponding elements of xx and and bb
A kind of reduction needed to find the identity of A kind of reduction needed to find the identity of the pivot rowthe pivot row
Tournament: want to determine identity of row Tournament: want to determine identity of row with largest value, rather than largest value itselfwith largest value, rather than largest value itself
Could be done with two all-reductionsCould be done with two all-reductions MPI provides a simpler, faster mechanismMPI provides a simpler, faster mechanism
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPI_MAXLOC, MPI_MINLOC
MPI provides reduction operators MPI provides reduction operators MPI_MAXLOC, MPI_MINLOCMPI_MAXLOC, MPI_MINLOC
Provide datatype representing a (value, Provide datatype representing a (value, index) pairindex) pair
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPI (value,index) DatatypesMPI_DatatypeMPI_Datatype MeaningMeaning
MPI_2INTMPI_2INT Two intsTwo ints
MPI_DOUBLE_INTMPI_DOUBLE_INT A double followed by an intA double followed by an int
MPI_FLOAT_INTMPI_FLOAT_INT A float followed by an intA float followed by an int
MPI_LONG_INTMPI_LONG_INT A long followed by an intA long followed by an int
MPI_LONG_DOUBLE_INTMPI_LONG_DOUBLE_INT A long double followed by A long double followed by an intan int
MPI_SHORT_INTMPI_SHORT_INT A short followed by an intA short followed by an int
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example Use of MPI_MAXLOC
struct { double value; int index;} local, global;...local.value = fabs(a[j][i]);local.index = j;...MPI_Allreduce (&local, &global, 1, MPI_DOUBLE_INT, MPI_MAXLOC, MPI_COMM_WORLD);
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Second Communication per Iteration
j
i
p ick ed
k
a [p icked ] [k]
a [p icked ] [i]
a [j] [i]
a [j] [k]
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Complexity
Complexity of tournament: Complexity of tournament: (log (log pp)) Complexity of broadcasting pivot row:Complexity of broadcasting pivot row:
((nn log log pp)) A total of A total of nn - 1 iterations - 1 iterations Overall communication complexity:Overall communication complexity:
((nn2 2 log log pp))
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Isoefficiency Analysis
Communication overhead: Communication overhead: ((nn22 p p log log pp)) Sequential algorithm has time complexity Sequential algorithm has time complexity
((nn33)) Isoefficiency relationIsoefficiency relation
nn33 CnCn22 p p log log pp nn C pC p log log pp
This system has poor scalabilityThis system has poor scalability
ppCpppCppCpM 22222 log/log/)log(
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Column-oriented Algorithm
Associate a primitive task with each column of Associate a primitive task with each column of AA and another primitive task for and another primitive task for bb
During iteration During iteration ii task controlling column task controlling column ii determines pivot row and broadcasts its identitydetermines pivot row and broadcasts its identity
During iteration During iteration ii task controlling column task controlling column ii must must also broadcast column also broadcast column ii to other tasks to other tasks
Agglomerate tasks in an interleaved fashion to Agglomerate tasks in an interleaved fashion to balance workloadsbalance workloads
Isoefficiency same as row-oriented algorithmIsoefficiency same as row-oriented algorithm
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Comparison of Two Algorithms
Both algorithms evenly divide workloadBoth algorithms evenly divide workload Both algorithms do a broadcast each iterationBoth algorithms do a broadcast each iteration Difference: identification of pivot rowDifference: identification of pivot row
Row-oriented algorithm does search in parallel Row-oriented algorithm does search in parallel but requires all-reduce stepbut requires all-reduce step
Column-oriented algorithm does search Column-oriented algorithm does search sequentially but requires no communicationsequentially but requires no communication
Row-oriented superior when Row-oriented superior when nn relatively larger relatively larger and and pp relatively smaller relatively smaller
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Problems with These Algorithms
They break parallel execution into They break parallel execution into computation and communication phasescomputation and communication phases
Processes not performing computations Processes not performing computations during the broadcast stepsduring the broadcast steps
Time spent doing broadcasts is large Time spent doing broadcasts is large enough to ensure poor scalabilityenough to ensure poor scalability
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pipelined, Row-Oriented Algorithm
Want to overlap communication time with Want to overlap communication time with computation timecomputation time
We could do this if we knew in advance the We could do this if we knew in advance the row used to reduce all the other rows.row used to reduce all the other rows.
Let’s pivot columns instead of rows!Let’s pivot columns instead of rows! In iteration In iteration ii we can use row we can use row i i to reduce the to reduce the
other rows.other rows.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Pattern
0
2
13
Row 0
Reducing UsingRow 0
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Pattern
0
2
13
Reducing UsingRow 0
Reducing UsingRow 0
Row 0
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Pattern
0
2
13
Reducing UsingRow 0
Reducing UsingRow 0
Reducing UsingRow 0
Row 0
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Pattern
0
2
13
Reducing UsingRow 0
Reducing UsingRow 0
Reducing UsingRow 0
Reducing UsingRow 0
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Pattern
0
2
13 Reducing UsingRow 0
Reducing UsingRow 0
Reducing UsingRow 0
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Pattern
0
2
13 Reducing UsingRow 1
Reducing UsingRow 0
Reducing UsingRow 0
Row 1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Pattern
0
2
13 Reducing UsingRow 1
Reducing UsingRow 1
Reducing UsingRow 0
Row 1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Pattern
0
2
13 Reducing UsingRow 1
Reducing UsingRow 1
Reducing UsingRow 1
Row 1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication Pattern
0
2
13
Reducing UsingRow 1
Reducing UsingRow 1
Reducing UsingRow 1
Reducing UsingRow 1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Analysis (1/2)
Total computation time: Total computation time: ((nn33//pp)) Total message transmission time: Total message transmission time: ((nn22)) When When nn large enough, message large enough, message
transmission time completely overlapped by transmission time completely overlapped by computation timecomputation time
Message start-up not overlapped: Message start-up not overlapped: ((nn)) Parallel overhead: Parallel overhead: ((npnp))
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Analysis (2/2)
Isoefficiency relation:Isoefficiency relation:
Scalability function:Scalability function:
Parallel system is perfectly scalableParallel system is perfectly scalable
CpnCnpn 3 CpnCnpn 3
CpCppCpM //)(
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Sparse Systems
Gaussian elimination not well-suited for Gaussian elimination not well-suited for sparse systemssparse systems
Coefficient matrix gradually fills with Coefficient matrix gradually fills with nonzero elementsnonzero elements
ResultResult Increases storage requirementsIncreases storage requirements Increases total operation countIncreases total operation count
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example of “Fill”
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Iterative Methods
Iterative method: algorithm that generates a Iterative method: algorithm that generates a series of approximations to solution’s valueseries of approximations to solution’s value
Require less storage than direct methodsRequire less storage than direct methods Since they avoid computations on zero Since they avoid computations on zero
elements, they can save a lot of elements, they can save a lot of computationscomputations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Jacobi Method
ij
kjjiia
ki xabx
ii)( ,
11
,
ij
kjjiia
ki xabx
ii)( ,
11
,
Values of elements of vector x at iteration k+1depend upon values of vector x at iteration k
Gauss-Seidel method: Use latest versionavailable of xi
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Jacobi Method Iterations4
3
2
1
1 2 3 40x
x3
1x
2x
4x
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Rate of Convergence
Even when Jacobi method and Gauss-Seidel Even when Jacobi method and Gauss-Seidel methods converge on solution, rate of methods converge on solution, rate of convergence often too slow to make them convergence often too slow to make them practicalpractical
We will move on to an iterative method We will move on to an iterative method with much faster convergencewith much faster convergence
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Conjugate Gradient Method
AA is positive definite if for every nonzero vector x is positive definite if for every nonzero vector x and its transpose and its transpose xxTT, the product , the product xxTTAxAx > 0 > 0
If If AA is symmetric and positive definite, then the is symmetric and positive definite, then the functionfunction
has a unique minimizer that is solution to has a unique minimizer that is solution to AxAx = = bb Conjugate gradient is an iterative method that Conjugate gradient is an iterative method that
solves solves AxAx = = bb by minimizing by minimizing q(x)q(x)
cbxAxxxq TT 21)(
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Conjugate Gradient Convergence4
3
2
1
1 2 3 40x
1x 2
x Finds value ofn-dimensional solutionin at most n iterations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Conjugate Gradient Computations
Matrix-vector multiplicationMatrix-vector multiplication Inner product (dot product)Inner product (dot product) Matrix-vector multiplication has higher Matrix-vector multiplication has higher
time complexitytime complexity Must modify previously developed Must modify previously developed
algorithm to account for sparse matricesalgorithm to account for sparse matrices
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Rowwise Block Striped Decomposition of a Symmetrically Banded Matrix
Matrix
Decomposition
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Representation of Vectors
Replicate vectorsReplicate vectors Need all-gather step after matrix-vector Need all-gather step after matrix-vector
multiplymultiply Inner product has time complexity Inner product has time complexity ((nn))
Block decomposition of vectorsBlock decomposition of vectors Need all-gather step before matrix-vector Need all-gather step before matrix-vector
multiplymultiply Inner product has time complexityInner product has time complexity
((n/p + n/p + loglog p p))
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Comparison of Vector Decompositions
R ep lic a tedVec to r sS u p er io r
Blo c kD ec o m p o s it io nS u p er io r
p
n
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (1/2)
Solving systems of linear equationsSolving systems of linear equations Direct methodsDirect methods Iterative methodsIterative methods
Parallel designs forParallel designs for Back substitutionBack substitution Gaussian eliminationGaussian elimination Conjugate gradient methodConjugate gradient method
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (2/2)
Superiority of one algorithm over another Superiority of one algorithm over another depends on size of problem, number of depends on size of problem, number of processors, characteristics of parallel processors, characteristics of parallel computercomputer
Overlapping communications with Overlapping communications with computations can be key to scalabilitycomputations can be key to scalability