1
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Using Tensor Methods, PETSc, and SLEPc to ObtainExact Cumulative Reaction Probability
M. Minkoff and D. Kaushik
2
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Outline
• Overview• Description of CRP Simulation Problem• Using PETSc and SLEPc for Application to CRP• Computations for Sparse Matrices• Results for Banded Preconditioning• Future Directions• Tensor Matrix Multiplication
3
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Chemical Dynamics Theory3 angles, 3 stretches6 degrees of freedom
4
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Chemical Dynamics Theory• Reaction rates are related to
- Cumulative Reaction Probability (CRP), N(E)N(E) = 4 Tr [ εr
1/2 G(E)† εp G(E) εr1/2 ]
• Where Tr is the trace of the matrix, † is the adjoint, εr and εp are the
absorbing potential in the reactant and product regions.• εr
+ εp = ε, the given total absorbing potential.The Green’s functions have the form:
G(E) = (E + iε - H)-1 , where i imaginary and H is the Hamiltonian.
We need to solve two linear systems (at each iteration):
(E + iε - H) y = x and it’s adjoint where x is known.
This system is solved via GMRES with preconditioning methods (initiallydiagonal scaling).
5
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Hamiltonian Sparsity Pattern
• Sparsity for d=2 and d>2
6
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Non-zero Storage for Green’s Matrix
0.01
0.1
1
10
100
1000
104
1 10
Storage Required for F=3.5 Cutoff, .32ev
(3% Accuracy)
Matrix Size (MWords)Extrapolation Curve
Matrix Size (MWords)
Dimension
7
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Eigenvalues vs. Total Energy
8
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Performance vs. Processors
9
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Banded Single-Processor Approach
• Compare diagonal and bandedpreconditioner in terms of reducing totaliteration count and cost
• Compare iterative methods (Davidson,GMRES) to benchmark PETSc results
• Evaluate relative cost of banded operationswith sparse-matrix approach in PETSc
10
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
10- 2
100
102
104
106
108
1010
1012
1014
3 4 5 6 7 8 9 10 11 12 13
TIm
e-to
-sol
n (m
in)
No. of Dimensions
banded pr econdi ti oner
di agonal pr econdi ti oner
1 hour
1 day
512 hour s
Diagonal vs. Banded Diagonal vs. Banded PreconditionerPreconditioner
11
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
10
100
1000
10000
100000
3 4 4 5 6
Non LU Decomposition
Time (in .01sec)
Tim
e (i
n .0
1sec
)
No. of Dimensi ons
Y = M0*e M1*X
0.053049M02.5021M1
0.99999R
12
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
10 0
10 1
10 2
10 3
10 4
10 5
3 4 5
Time for LU Preconditioner
Time ( i n .01 sec)Ti
me[
LU]
(.01
sec)
No. of Dimensions
Y = M0* e M1* X
4.0707e- 05M03.9286M1
0.99999R
13
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Results and Future Work
• Developing global and block orthogonal(W. Poirier) preconditioning methods
• Use SLEPc for Lanczos iteration andtensor products for efficient matrix-vectormultiplies
• Use NLCF IBM BGL to solve 10 DOFproblems
!
A U.S. Department of EnergyOffice of Science LaboratoryOperated by The University of Chicago
Argonne National Laboratory
Office of ScienceU.S. Department of Energy
Improving the Performance of Tensor MatrixVector Product
Dinesh Kaushik Argonne National Laboratory
15
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Tensor Matrix Vector Product
• Operator comes from the tensor product of a densematrix with the identity matrix
• Ax, Ay, Az are one directional operators (dense)• v and w are vectors of size n3
Avw =
xyz AIIIAIIIAA !!+!!+!!=
16
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Two Ways
• Build the large sparse matrix- Large sparse matrix of size (n3 x n3 for 3D case)- Slow memory bandwidth limited performance
• Just evaluate the action of A on v (withoutexplicitly forming A)- Done as dense matrix-matrix multiplication- Very efficient implementation- Huge savings in memory
17
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Performance Issues for Sparse Matrix Vector Product
• Little data reuse• High ratio of load/store to
instructions/floating-point ops• Stalling of multiple load/store functional
units on the same cache line• Low available memory bandwidth
18
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Sparse Matrix Vector Algorithm: A GeneralForm
for every row, i { fetch ia(i+1) for j = ia(i) to ia(i + 1) { // loop over the non-zeros of the
row fetch ja(j), a(j), x1(ja(j)), ..…xN(ja(j)) do N fmadd (floating multiply add) } Store y1(i) ..…yN(i) }
19
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Estimating the Memory Bandwidth Limitation
Assumptions
• Perfect Cache (only compulsory misses; no overhead)• No memory latency• Unlimited number of loads and stores per cycle
Data Volume (AIJ Format)
m*sizeof(int) + N*(m+n)*sizeof(double) // ia, N input (size n) and output (size m) vectors
+ Nnz* (sizeof(int) + sizeof(double))// ja, and a arrays
= 4*(m+nnz) + 8*(N*(m+n)+ Nnz)
20
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
• Number of Floating-Point Multiply Add (fmadd) Ops = N*nz• For square matrices,
(Since Nnz >> n, Bytes transferred / fmadd ~12/N)
• Similarly, for Block AIJ (BAIJ) format
Estimating the Memory Bandwidth Limitation(Contd.)
21
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Realistic Measures of Peak PerformanceSparse Matrix Vector Product
One vector, matrix size, m = 90,708, nonzero entries nz = 5,047,120
22
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Second Choice: Dense Matrix-Matrix Multiplication
• We just need to store the small densematrices of size nxn- for 3 dimensions memory needed is 3n2
- Good ratio of flops to bytes: O(n4) operations O(n3)doubles
- Gets better for higher dimensions
23
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Evaluating the Tensor Product Terms
• Type 1
• Type 2
• Type 3
- Loop over Type 2 for i = 1, p
[ ] [ ]nxmnxnnmnnm
VAvAI =! )(
[ ] [ ] T
nxnnmxnmnmnAVvIA =! )(
mnmnp vIAI )( !!
24
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Performance of Tensor Matrix-VectorMultiplication – 3D case
(Intel Madison Processor 1.5 GHz, 6 Gflops/s Peak, 4 GB Memory)Memory Bandwidth Limited Bound 670 Mflops/s
n
Mflops/s
20 40 60 80 100
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
Custom
MXM
DGEMM
25
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Performance of Tensor Matrix-Vector Multiplication –Fixed Mesh Points (n=7)
(Intel Madison Processor 1.5 GHz, 6 Gflops/s Peak, 4 GB Memory)
Dimensions
Mflops/s
3 4 5 6 7 8 9 10
250
500
750
1000
1250
1500
1750
2000
2250
2500
2750
3000
3250
3500
Custom
MXM
DGEMM
26
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Performance of Tensor Matrix-Vector Multiplication–Long Reaction Co-ordinate
(51 points along reaction path and 7 points in other dimensions)
Dimensions
Mflops/s
3 4 5 6 7 8 9 10
1000
2000
3000
4000
5000
Custom
MXM
DGEMM
27
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Conclusions and Future Work
• Very efficient implementation- Sparse matvecs take about 80% of execution time- We expect that tensor product implementation can
improve the performance by a factor of three to five• Possible to solve much larger problems
because of huge savings in memoryrequirement
• Parallel implementation
28
PioneeringScience andTechnology
Office of Science U.S. Department
of Energy
Acknowledgements
• Barry Smith, William Gropp, and PaulFischer for many helpful discussions