Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 1
A New Asynchronous Solver for Banded Linear Systems
2015 SIAM Conference on Applied Linear Algebra October 29, 2015 Michael Jandron Naval Undersea Warfare Center, Newport, RI Anthony Ruffa, NUWC, Newport, RI Raymond Roberts, NUWC, Newport, RI James Baglama, University of Rhode Island, Kingston, RI
1
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 2 2
Looking for new techniques to complement these tried‐and‐true methods
• Large sparse problems take a while to solve (days, months, years) – Direct methods still are useful – In FEA, substructuring, Shur Complement, multi-frontal methods common and rely
on a Gaussian Elimination backbone which is difficult to parallelize – Always looking for ways to increase levels of parallelization and decrease
communication bound
Motivation
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 3 3
• Tridiagonal solver – Limitations and what it’s good for
• Pentadiagonal solver • General banded solver
– Theoretical speedup predictions – Development – Numerical implementation – Numerical benchmarks
• Conclusions and future work
Outline
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 4
Method for Tridiagonal Systems
4
Augment an unknown to the system [1‐3]
Given the following linear system Split into two tasks
1
2
Principle of superposition applies
Last equation gives:
Final vectorized superposition
[2] Jandron, M., Ruffa, A., Baglama, J., “An Asynchronous Direct Solver for Banded Linear Systems,” Numerical Algorithms (2015, Submitted) [3] Ruffa, A., Jandron, M., Toni, B., “Parallelized Solution of Banded Linear Systems,” STEAM‐H Springer Series Contribution (2015, Submitted)
[1] Ruffa, A., “A Solution Approach for Lower Hessenberg Linear Systems,” ISRN Applied Mathematics (2011)
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 5
System Details for Tridiagonal Systems
5
Undetermined matrix – solution to within constant Choose and arbitrarily
and solve for remaining unknowns
1 2
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 6
Limitations of Modified Forward Sub
6
0 20 40 60 80 100-3
-2
-1
0
1
2
3 x 10-14
Unknown (k)
Erro
r (b-
Ax)
BackslashMFS
0 20 40 60 80 100-30
-25
-20
-15
-10
-5
0
Unknown (k)
Solu
tion
(x)
BackslashMFS
0 20 40 60 80 100-1
-0.5
0
0.5
Unknown (k)
Erro
r (b-
Ax)
BackslashMFS
0 20 40 60 80 100-0.5
-0.4
-0.3
-0.2
-0.1
0
Unknown (k)
Solu
tion
(x)
BackslashMFS
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
0 1 2
-1
-0.5
0
0.5
1
Alternate methods?
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 7
System Details for Tridiagonal Systems
7
1 2
Option 1: A modified forward substitution scheme
Option 2: Using the pseudoinverse General, but can be slower and memory intensive
Fast, but can be unreliable in some cases without a form of pivoting or precision control
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 8
Method for Pentadiagonal Systems
8 How does it work for general banded systems?
Add a two variables
Given the following linear system Split into three tasks
1
2
Principle of superposition:
Last two gives a constraint linear system:
3
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 9
Extension to Banded Systems
9
Independent linear systems
Partial solution vectors
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 10
Extension to Banded Systems
10
Constraint Matrix
Superposition
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 11
Numerical Implementation
11 Even the constraint matrix can be split up if desired
Request solution Broadcast to each available core
Begin asynchronous forward substitution
as it arrives
Send extra variables back as they are formed
Once all extra variables come back, tackle constraint matrix using any dense solver
Master thread
Level 1 superposition to get final solution
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 12
Banded Systems Expected Speedups
Speedup Number of superdiagonals Number of subdiagonals Number of unknowns
Banded Gaussian Elimination
Forward / backward Substitution
Dense Constraint Matrix Solve
Superposition
‐core BMFS
Seq. BMFS
Same cost
Speedup
Seq. LU
Pentadiagonal should be ~ 8x faster than sequential LU Tridiagonal should be ~ 2x faster than sequential LU
Heptadiagonal should be ~ 18x faster than sequential LU
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 13
0 2000 4000 6000 8000 100000
100
200
300
400
q
X
1-core8-coreq-core
Banded Systems Expected Speedups
Anticipated speedup over sequential LU using a various number of cores
1 core is 0.5X 8‐core is 4X
n = 1,000,000
0 2 4 6 8 10x 105
0
2000
4000
6000
8000
10000
12000
q
X
1-core8-coreq-core
n = 1,000,000,000
For the same number of cores LU (e.g. multi‐frontal) must scale to these levels in order to match speed
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 14
Banded Systems Expected Speedups
We know optimal locations for max speedup over sequential LU
For the same number of cores LU (e.g. multi‐frontal) must scale to these levels in order to match speed
Speedup Number of superdiagonals Number of subdiagonals Number of unknowns
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 15
Numerical Benchmarks
15
Tests dependence without exponential growth
For simplicity let’s just consider symmetric cases
Implementation FORTRAN 90 OPENMP with 8‐cores PARDISO 5.0.0 [1‐3] Solver using 8‐cores
[1] M. Luisier, O. Schenk et.al.,Fast Methods for Computing Selected Elements of the Green's Function in Massively Parallel Nanoelectronic Device Simulations, Euro‐Par 2013, LNCS 8097, F. Wolf, B. Mohr, and D. an Ney (Eds.), Springer‐Verlag Berlin Heidelberg, pp. 533–544, 2013, [2] O. Schenk, M. Bollhoefer, and R. Roemer, On large‐scale diagonalization techniques for the Anderson model of localization. Featured SIGEST paper in the SIAM Review selected "on the basis of its exceptional interest to the entire SIAM community". SIAM Review 50 (2008), pp. 91‐112. [3] O. Schenk, A. Waechter, and M. Hagemann, Matching‐based Preprocessing Algorithms to the Solution of Saddle‐Point Problems in Large‐Scale Nonconvex Interior‐Point Optimization. Journal of Computational Optimization and Applications, pp. 321‐341, Volume 36, Numbers 2‐3 / April, 2007.
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 16
Numerical Results with 8-cores
16
Wall time was less than PARDISO in certain cases without even scaling
Speedup Number of superdiagonals Number of unknowns
FORTRAN OpenMP Speedup results PARDISO 8 cores BMFS 8 cores
PARDISO BMFS
BMFS – 8‐core BMFS – 8‐core qxq solve BMFS – q‐core scaled PARDISO – 8‐core
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 17
Numerical Results with 8-cores
17
Increased error likely due to round off and errors inherent in constraint matrix solve
Speedup Number of superdiagonals Number of unknowns
FORTRAN OpenMP Speedup results PARDISO 8 cores BMFS 8 cores
BMFS – 8‐core BMFS – 8‐core qxq solve BMFS – q‐core scaled PARDISO – 8‐core
n = 100,000 n = 500,000
n = 1,000,000
n = 5,000,000
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 18
Speedup over PARDISO Solver
18
Speedup Number of superdiagonals Number of unknowns
From actual wall times 8‐core BMFS vs. 8‐core PARDISO
By scaling BMFS to q‐cores (not qxq solve part) vs. 8‐core PARDISO
Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 19
Summary
19
• Developed an direct solver that can skip the Gaussian elimination process while solving banded linear systems built on a superposition principle [1-3]
• Fastest for banded systems without exponential growth – Observed speedup over 20x faster than 8 thread PARDISO when using 8 threads for
small and large
• Can handle exponential growth (or really any problem thrown at it) by incorporating pseudoinverse calculations but this less attractive
• Future work involves: – Distributed memory/MPI/GPU computing – Can the pseudoinverse be used efficiently? – Can a form of pivoting be employed?
• End goal is to develop a competitive direct solver for banded systems with an eye on FEA applications
[2] Jandron, M., Ruffa, A., Baglama, J., “An Asynchronous Direct Solver for Banded Linear Systems,” Numerical Algorithms (2015, Submitted) [3] Ruffa, A., Jandron, M., Toni, B., “Parallelized Solution of Banded Linear Systems,” STEAM‐H Springer Series Contribution (2015, Submitted)
[1] Ruffa, A., “A Solution Approach for Lower Hessenberg Linear Systems,” ISRN Applied Mathematics (2011)