A New Asynchronous Solver for Banded Linear Systems...Michael Jandron – Naval Undersea Warfare...

Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 1

A New Asynchronous Solver for Banded Linear Systems

2015 SIAM Conference on Applied Linear Algebra October 29, 2015 Michael Jandron Naval Undersea Warfare Center, Newport, RI Anthony Ruffa, NUWC, Newport, RI Raymond Roberts, NUWC, Newport, RI James Baglama, University of Rhode Island, Kingston, RI

1

Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 2 2

Looking for new techniques to complement these tried‐and‐true methods

• Large sparse problems take a while to solve (days, months, years) – Direct methods still are useful – In FEA, substructuring, Shur Complement, multi-frontal methods common and rely

on a Gaussian Elimination backbone which is difficult to parallelize – Always looking for ways to increase levels of parallelization and decrease

communication bound

Motivation

Michael Jandron – Naval Undersea Warfare Center // Approved for Public Release 3 3

• Tridiagonal solver – Limitations and what it’s good for

• Pentadiagonal solver • General banded solver

– Theoretical speedup predictions – Development – Numerical implementation – Numerical benchmarks

• Conclusions and future work

Outline


Method for Tridiagonal Systems

4

Augment an unknown to the system [1‐3]

Given the following linear system Split into two tasks

1

2

Principle of superposition applies

Last equation gives:

Final vectorized superposition

[2] Jandron, M., Ruffa, A., Baglama, J., “An Asynchronous Direct Solver for Banded Linear Systems,” Numerical Algorithms (2015, Submitted) [3] Ruffa, A., Jandron, M., Toni, B., “Parallelized Solution of Banded Linear Systems,” STEAM‐H Springer Series Contribution (2015, Submitted)

[1] Ruffa, A., “A Solution Approach for Lower Hessenberg Linear Systems,” ISRN Applied Mathematics (2011)


System Details for Tridiagonal Systems

5

Undetermined matrix – solution to within constant Choose and arbitrarily

and solve for remaining unknowns

1 2


Limitations of Modified Forward Sub

6

0 20 40 60 80 100-3

-2

-1

0

1

2

3 x 10-14

Unknown (k)

Erro

r (b-

Ax)

BackslashMFS

0 20 40 60 80 100-30

-25

-20

-15

-10

-5

0

Unknown (k)

Solu

tion

(x)

BackslashMFS

0 20 40 60 80 100-1

-0.5

0

0.5

Unknown (k)

Erro

r (b-

Ax)

BackslashMFS

0 20 40 60 80 100-0.5

-0.4

-0.3

-0.2

-0.1

0

Unknown (k)

Solu

tion

(x)

BackslashMFS

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

0 1 2

-1

-0.5

0

0.5

1

Alternate methods?


System Details for Tridiagonal Systems

7

1 2

Option 1: A modified forward substitution scheme

Option 2: Using the pseudoinverse General, but can be slower and memory intensive

Fast, but can be unreliable in some cases without a form of pivoting or precision control


Method for Pentadiagonal Systems

8 How does it work for general banded systems?

Add a two variables

Given the following linear system Split into three tasks

1

2

Principle of superposition:

Last two gives a constraint linear system:

3


Extension to Banded Systems

9

Independent linear systems

Partial solution vectors


Extension to Banded Systems

10

Constraint Matrix

Superposition


Numerical Implementation

11 Even the constraint matrix can be split up if desired

Request solution Broadcast to each available core

Begin asynchronous forward substitution

as it arrives

Send extra variables back as they are formed

Once all extra variables come back, tackle constraint matrix using any dense solver

Master thread

Level 1 superposition to get final solution


Banded Systems Expected Speedups

Speedup Number of superdiagonals Number of subdiagonals Number of unknowns

Banded Gaussian Elimination

Forward / backward Substitution

Dense Constraint Matrix Solve

Superposition

‐core BMFS

Seq. BMFS

Same cost

Speedup

Seq. LU

Pentadiagonal should be ~ 8x faster than sequential LU Tridiagonal should be ~ 2x faster than sequential LU

Heptadiagonal should be ~ 18x faster than sequential LU


0 2000 4000 6000 8000 100000

100

200

300

400

q

X

1-core8-coreq-core


Anticipated speedup over sequential LU using a various number of cores

1 core is 0.5X 8‐core is 4X

n = 1,000,000

0 2 4 6 8 10x 105

0

2000

4000

6000

8000

10000

12000

q

X

1-core8-coreq-core

n = 1,000,000,000

For the same number of cores LU (e.g. multi‐frontal) must scale to these levels in order to match speed



We know optimal locations for max speedup over sequential LU

For the same number of cores LU (e.g. multi‐frontal) must scale to these levels in order to match speed

Speedup Number of superdiagonals Number of subdiagonals Number of unknowns


Numerical Benchmarks

15

Tests dependence without exponential growth

For simplicity let’s just consider symmetric cases

Implementation FORTRAN 90 OPENMP with 8‐cores PARDISO 5.0.0 [1‐3] Solver using 8‐cores

[1] M. Luisier, O. Schenk et.al.,Fast Methods for Computing Selected Elements of the Green's Function in Massively Parallel Nanoelectronic Device Simulations, Euro‐Par 2013, LNCS 8097, F. Wolf, B. Mohr, and D. an Ney (Eds.), Springer‐Verlag Berlin Heidelberg, pp. 533–544, 2013, [2] O. Schenk, M. Bollhoefer, and R. Roemer, On large‐scale diagonalization techniques for the Anderson model of localization. Featured SIGEST paper in the SIAM Review selected "on the basis of its exceptional interest to the entire SIAM community". SIAM Review 50 (2008), pp. 91‐112. [3] O. Schenk, A. Waechter, and M. Hagemann, Matching‐based Preprocessing Algorithms to the Solution of Saddle‐Point Problems in Large‐Scale Nonconvex Interior‐Point Optimization. Journal of Computational Optimization and Applications, pp. 321‐341, Volume 36, Numbers 2‐3 / April, 2007.


Numerical Results with 8-cores

16

Wall time was less than PARDISO in certain cases without even scaling

Speedup Number of superdiagonals Number of unknowns

FORTRAN OpenMP Speedup results PARDISO 8 cores BMFS 8 cores

PARDISO BMFS

BMFS – 8‐core BMFS – 8‐core qxq solve BMFS – q‐core scaled PARDISO – 8‐core


Numerical Results with 8-cores

17

Increased error likely due to round off and errors inherent in constraint matrix solve


FORTRAN OpenMP Speedup results PARDISO 8 cores BMFS 8 cores

BMFS – 8‐core BMFS – 8‐core qxq solve BMFS – q‐core scaled PARDISO – 8‐core

n = 100,000 n = 500,000

n = 1,000,000

n = 5,000,000


Speedup over PARDISO Solver

18


From actual wall times 8‐core BMFS vs. 8‐core PARDISO

By scaling BMFS to q‐cores (not qxq solve part) vs. 8‐core PARDISO


Summary

19

• Developed an direct solver that can skip the Gaussian elimination process while solving banded linear systems built on a superposition principle [1-3]

• Fastest for banded systems without exponential growth – Observed speedup over 20x faster than 8 thread PARDISO when using 8 threads for

small and large

• Can handle exponential growth (or really any problem thrown at it) by incorporating pseudoinverse calculations but this less attractive

• Future work involves: – Distributed memory/MPI/GPU computing – Can the pseudoinverse be used efficiently? – Can a form of pivoting be employed?

• End goal is to develop a competitive direct solver for banded systems with an eye on FEA applications

[2] Jandron, M., Ruffa, A., Baglama, J., “An Asynchronous Direct Solver for Banded Linear Systems,” Numerical Algorithms (2015, Submitted) [3] Ruffa, A., Jandron, M., Toni, B., “Parallelized Solution of Banded Linear Systems,” STEAM‐H Springer Series Contribution (2015, Submitted)

[1] Ruffa, A., “A Solution Approach for Lower Hessenberg Linear Systems,” ISRN Applied Mathematics (2011)

Date post:	26-Jan-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

A New Asynchronous Solver for Banded Linear Systems...Michael Jandron – Naval Undersea Warfare...

Documents