Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | jason-riedy |
View: | 1,134 times |
Download: | 1 times |
The Future of LAPACK andScaLAPACK
Jason Riedy, Yozo Hida, James Demmel
EECS DepartmentUniversity of California, Berkeley
November 18, 2005
Outline
Survey responses: What users want
Improving LAPACK and ScaLAPACKImproved NumericsImproved PerformanceImproved FunctionalityImproved Engineering and Community
Two Example ImprovementsNumerics: Iterative Refinement for Ax = bPerformance: The MRRR Algorithm for Ax = λx
2 / 16
Survey: What users want
I Survey available fromhttp://www.netlib.org/lapack-dev/.
I 212 responses, over 100 different, non-anonymous groups
I Problem sizes:100 1K 10K 100K 1M (other)8% 26% 24% 12% 6% (24%)
I >80% interested in small-medium SMPs
I >40% interested in large distributed-memory systems
I Vendor libs seen as faster, buggier
I over 20% want > double precision, 70% out-of-core
I Requests: High-level interfaces, low-level interfaces,parallel redistribution∗ and tuning
3 / 16
ParticipantsI UC Berkeley
I Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye
Li, Osni Marques, Christof Vomel, David Bindel, Yozo Hida,
Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads, . . .I U Tennessee, Knoxville
I Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,
Stan Tomov, . . .I Other Academic Institutions
I UT Austin, UC Davis, U Kansas, U Maryland, North Carolina
SU, San Jose SU, UC Santa Barbara, TU Berlin, FU Hagen,
U Madrid, U Manchester, U Umea, U Wuppertal, U ZagrebI Research Institutions
I CERFACS, LBL, UEC (Japan)I Industrial Partners
I Cray, HP, Intel, MathWorks, NAG, SGI
You?
4 / 16
Improved Numerics
Improved accuracy with standard asymptotic speed:Some are faster!
I Iterative refinement for linear systems, least squaresDemmel / Hida / Kahan / Li / Mukherjee / Riedy / Sarkisyan
I Pivoting and scaling for symmetric systemsI Definite and indefinite
I Jacobi SVD (and faster) — Drmac / Veselic
I Condition numbers and estimatorsHigham / Cheng / Tisseur
I Useful approximate error estimates
5 / 16
Improved Performance
Improved performance with at least standard accuracy
I MRRR algorithm for eigenvalues, SVDParlett / Dhillon / Vomel / Marques / Willems / Katagiri
I Fast Hessenberg QR & QZByers / Mathias / Braman, Kagstrom / Kressner
I Fast reductions and BLAS2.5van de Geijn, Bischof / Lang, Howell / Fulton
I Recursive data layoutsGustavson / Kagstrom / Elmroth / Jonsson
I generalized SVD — Bai, Wang
I Polynomial roots from semi-separable formGu / Chandrasekaran / Zhu / Xia / Bindel / Garmire / Demmel
I Automated tuning, optimizations in ScaLAPACK, . . .
6 / 16
Improved Functionality
Algorithms
I Updating / downdating factorizations — Stewart, Langou
I More generalized SVDs: products, CSD — Bai, Wang
I More generalized Sylvester, Lyupanov solversKagstrom, Jonsson, Granat
I Quadratic eigenproblems — Mehrmann
I Matrix functions — Higham
Implementations
I Add “missing” features to ScaLAPACK
I Generate LAPACK, ScaLAPACK for higher precisions
7 / 16
Improved Engineering and Community
Use new features without a rewrite
I Use modern Fortran 95, maybe 2003I DO . . . END DO, recursion, allocation (in wrappers)
I Provide higher-level wrappers for common languagesI F95, C, C++
I Automatic generation of precisions, bindingsI Full automation (FLAME, etc.) not quite ready for all
functions
I Tests for algorithms, implementations, installations
Open development
Need a community for long-term evolution.http://www.netlib.org/lapack-dev/
Lots of work to do, research and development.
8 / 16
Two Example Improvements
Recent, locally developed improvements
Improved Numerics
Iterative refinement for linear systems Ax = b:
I Extra precision ⇒ small error, dependable estimate
I Both normwise and componentwise
I (See LAWN 165 for full details.)
Improved Performance
MRRR algorithm for eigenvalue, SVD problems
I Optimal complexity: O(n) per value/vector
I (See LAWNs 162, 163, 166, 167. . . for more details.)
9 / 16
Numerics: Iterative RefinementImprove solution to Ax = b
Repeat: r = b − Ax , dx = A−1r , x = x + dx
Until: good enough
Not-too-ill-conditioned ⇒ error O(√
n ε)
log10
κnorm
log10E
norm
Normwise error vs. condition number κnorm. (2000000 cases)
167212 1156099
463800 22544
190085 260
0 5 10 15100
101
102
103
104
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
log10 κnorm
log10E
norm
Normwise error vs. condition number κnorm. (2000000 cases)
40377
1539
821097 1136987
0 5 10 15100
101
102
103
104
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
10 / 16
Numerics: Iterative RefinementImprove solution to Ax = b
Repeat: r = b − Ax , dx = A−1r , x = x + dx
Until: good enough
Dependable normwise relative error estimate
log10 Enorm
log10B
norm
Normwise Error vs. Bound (2000000 cases)
133497 485930
1323311
56848
414
-8 -6 -4 -2 0100
101
102
103
104
105
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
log10
Enorm
log10B
norm
Normwise Error vs. Bound (2000000 cases)
100
40359
343
1361 18
1957741 78
-8 -6 -4 -2 0100
101
102
103
104
105
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
11 / 16
Numerics: Iterative RefinementImprove solution to Ax = b
Repeat: r = b − Ax , dx = A−1r , x = x + dx
Until: good enough
Also small componentwise errors and dependable estimates
log10 κcomp
log10E
com
p
Componentwise error vs. condition number κcomp. (2000000 cases)
41755
32249
545427 1380569
0 5 10 15100
101
102
103
104
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
PSfrag replacemen
log10 Ecomp
log10B
com
p
Componentwise Error vs. Bound (2000000 cases)
1 236
41702
13034
25307 53
1912961 6706
-8 -6 -4 -2 0100
101
102
103
104
105
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
12 / 16
Relying on Condition Numbers
Need condition numbers for dependable estimates.
Picking the right condition number and estimating it well.
log10 κnorm (single precision)
log10κ
norm
(double
pre
cisi
on)
κnorm: single vs. double precision (2000000 cases)
0.0% 28.9%
30.1%
19.2%
21.8% 0.0%
0 2 4 6 8 10 12 14100
101
102
103
104
0
2
4
6
8
10
12
14
13 / 16
Performance: The MRRR Algorithm
Multiple Relatively Robust Representations
I 1999 Householder Award honorable mention for Dhillon
I Optimal complexity with small error!I O(nk) flops for k eigenvalues/vectors of n × n
tridiagonal matrixI Small residuals: ‖Txi − λixi‖ = O(nε)I Orthogonal eigenvectors: ‖xT
i xj‖ = O(nε)
I Similar algorithm for SVD.
I Eigenvectors computed independently ⇒ naturallyparallelizable
I (LAPACK r3 had bugs, missing cases)
14 / 16
Performance: The MRRR Algorithm
“fast DC”: Wilkinson, deflate like crazy
15 / 16
Summary
I LAPACK and ScaLAPACK are open for improvement!
I Planned improvements inI numerics,I performance,I functionality, andI engineering.
I Forming a community for long-term development.
16 / 16