Ab Initio Quantum Chemistry
on Graphics Processing UnitsRethinking Algorithms for Massively Parallel Architectures
Jorg Kussmann
Theoretical Chemistry, University of Munich (LMU)
23rd May 2014
J. Kussmann Quantum Chemistry@GPU
Outline
Introduction
Challenges of Ab Initio Quantum Chemistry
Optimizing SCF-Algorithms @ GPUs
Data-Arrangement
Coulomb-, Exchange-, XC-Potential
Exchange Potential: GPU-specific optimization
Examplary Calculations: SCF & Properties
Hybrid MPI/CUDA Parallelization
Outlook: Post-HF Algorithms @ GPUs
Challenge
SOS-MP2 @ GPUs
J. Kussmann Quantum Chemistry@GPU
PART 1: Ab Initio Methods
Schrodinger equation:
Molecular properties:
Energetics/Geometries
Vibrational frequencies
Electric properties
Magnetic properties
Dynamic properties
HΨ = i~Ψstat−→ HΨ = EΨ
Conventional methods:
Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!
J. Kussmann Quantum Chemistry@GPU
PART 1: Ab Initio Methods
Schrodinger equation:
Molecular properties:
Energetics/Geometries
Vibrational frequencies
Electric properties
Magnetic properties
Dynamic properties
HΨ = i~Ψstat−→ HΨ = EΨ
Conventional methods:
Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!
J. Kussmann Quantum Chemistry@GPU
PART 1: Ab Initio Methods
Schrodinger equation:
Molecular properties:
Energetics/Geometries
Vibrational frequencies
Electric properties
Magnetic properties
Dynamic properties
HΨ = i~Ψstat−→ HΨ = EΨ
Conventional methods:
Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))
Aim: Reduce scaling to O(M)!
J. Kussmann Quantum Chemistry@GPU
Computational Effort: SCF Calculations
Roothaan-Hall: FC = SCǫ
Fµν = hcoreµν + Jµν [P] − (1 − a)Kµν [P] + V XC
µν [a, P]
Rate-determing steps:
1) Fock-Build O(N2)−→O(N)2) Diagonalization: F −→ C O(N3)−→O(N)
aaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Beer/Ochsenfeld, WIREs Comput Mol Sci 3, 614 (2013)]
{
a = 0 : HF0 < a < 1 : hybrid-DFTa = 1 : KS-DFT
Example: 16 A-T base pairs
HF/SVP (ϑint = 10−10, ϑconv = 10−7)
1052 atoms, 11230 basis functions
3 078 087 function pairs9.5 × 1012 primitive 2-e− integrals
O(N) Fock-Build (8 cores): 30 000 s(19 SCF-iterations for tight convergence)
J. Kussmann Quantum Chemistry@GPU
Moore’s Law: 1965-2010
Embrace new technologies: GPUs
J. Kussmann Quantum Chemistry@GPU
Moore’s Law: 1965-2010
Embrace new technologies: GPUs
J. Kussmann Quantum Chemistry@GPU
Implementation of GPU-algorithms
Automatic code generation
All double-precision, higher l-qn support
Coulomb
McMurchie-Davidson based J-engine
Pre/Post-processing on CPU
Ignore bra/ket symmetry (2 x integrals)
Exchange
McMurchie-Davidson
Evaluate complete integral on GPU
Exploit only 1 permutational symmetry (4 x integrals)
1 thread / 1 prim. integral: fine-grained data arrangement
[Ufimtsev/Martinez, JCTC 4, 222 (2008)]
J. Kussmann Quantum Chemistry@GPU
Coulomb Potential
J. Kussmann Quantum Chemistry@GPU
Exchange Potential
J. Kussmann Quantum Chemistry@GPU
Implementation of GPU-algorithms
Automatic code generation
All double-precision, higher l-qn support
Coulomb
McMurchie-Davidson based J-engine
Pre/Post-processing on CPU
Ignore bra/ket symmetry (2 x integrals)
Exchange
McMurchie-Davidson
Evaluate complete integral on GPU
Exploit only 1 permutational symmetry (4 x integrals)
Coulomb very fast, try to improve on exchange first...
A) Reduce scaling to linear
B) Reduce local memory effort
C) Reduce shared memory effort
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: O(N) Exact Exchange on GPUs
Problem: O(N) algorithms employ loads of book-keeping,
Problem: branching, communication
Loop: bra l-quantum number combination
Loop: ket l-quantum number combination
Loop: bra shell-pairs µ, λ
Determine sig. (µλ|σν) quartets:
QµλPmaxλσ Qσν ≥ ϑint + permutations
Loop: ket shell-pairs σ, ν
Evaluate: Kµν, Kµσ, Kλν, Kλσ
End Loop
End Loop
End Loop
Screening within inner loop
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: O(N) Exact Exchange on GPUs
Problem: O(N) algorithms employ loads of book-keeping,
Problem: branching, communication
Solution: Perform screening prior to integral evaluation by
Solution: pre-selection: PreLinK
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: O(N) Exact Exchange on GPUs
Problem: O(N) algorithms employ loads of book-keeping,
Problem: branching, communication
Solution: Perform screening prior to integral evaluation by
Solution: pre-selection: PreLinK
Kµν =∑
λσ
(µλ|νσ)Pλσ
Schwarz: (µλ|νσ) ≤ QµλQνσ =√
(µλ|µλ)√
(νσ|νσ)
PreLinK: Q′
µν =∑
λσ
QµλQνσ|Pλσ| ≥ Kµν
−→ Q′
= Q × |P| × Q
Determine significant elements of K from Q′
!
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Ochsenfeld, JCP 138, 134114 (2013)]
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: Pre-Selection Threshold
|P| Overestimation of K
16 α-D-glucose units, HF/SVP
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: Pre-Selection Threshold
Effect of pre-selection on final SCF energy
DNA-fragment with 4 A-T base-pairs, HF/SVP
(ϑconv = 10−7, ϑint = 10−10).
Errors in µHartree.
Error always below convergence criterion
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: Timings
Linear alkanes, HF/SV, max.: C640H1282
1-4 x NVidia M2090 (old generation, Kepler: approx. 3 x faster)
J. Kussmann Quantum Chemistry@GPU
B) Improving the Exchange: Reduced Local Memory
16 A-T base pairs, HF/SVP (ϑint = 10−10, ϑpre = 10−3, 1 x GTX Titan)
Resort to Rys-quadrature for larger total l-qn
J. Kussmann Quantum Chemistry@GPU
C) Improving the Exchange: Reduced Shared Memory
Shared Memory per thread-block
Most suitable size: 8x8 thread-blocks, use shared memory for Kµν
Ex.: d-shells (l-qn = 2), 48 kB shared memory
36 cartesian Kµν elements
Memory per thread-block: 8 x 8 x 8 (double) x 36 = 18.4 kB
Max. 2 thread-blocks per SMX, only 128 out of 192 cores
25 pure Kµν elements
Memory per thread-block: 8 x 8 x 8 (double) x 25 = 12.8 kB
Max. 3 thread-blocks per SMX, 192 out of 192 cores
Direct transformation to pure allows larger l-qn shells!
Ex.: 2 A-T base pairs, HF/TZVP
267 s (cart) vs 216 s (pure)
Significant impact: 20% speedup
Only ca. 7% of l-qn combinations affected
J. Kussmann Quantum Chemistry@GPU
Examplary Calculations: Water-Cluster
SCF Fock-Build and Nuclear Gradient (4 x GTX Titan, PBE0/SVP, 75/302)
PreLinK for Gradients [Kussmann/Ochsenfeld, in preparation]
J. Kussmann Quantum Chemistry@GPU
NMR-Shieldings @ GPU
Timings: Water-Clusters (4 x GTX Titan, PBE0/SVP, 75/302)
Algorithm
dJ/dB: Reuse SCF-kernels with l + 1, different post-processing
dK/dB: Special GPU-kernels
K [dP/dB]: 6 x SCF-kernels (skew symmetry)
J. Kussmann Quantum Chemistry@GPU
CIS/RPA @ GPU
Timings: Water-Clusters (4 x GTX Titan, PBE/SVP, 75/302)
J. Kussmann Quantum Chemistry@GPU
Hybrid MPI/CUDA Parallelization: SCF Calculations
HF/SVP (Single Fock-build, ϑint = 10−10, ϑpre = 10−3)
16 A-T base pairs (H2O)1123
Hardware/Parallelization
Per Node: 12 CPU cores (Intel E5-2620 v2 @ 2.0 GHz), 4 GTX Titan
Primitive Load-balancing, Master-Slave work distribution
1 Gb Ethernet
J. Kussmann Quantum Chemistry@GPU
Hybrid MPI/CUDA Parallelization: SCF Calculations
HF/SVP
16 A-T base pairs (H2O)1123
J. Kussmann Quantum Chemistry@GPU
Hybrid MPI/CUDA Parallelization: MutM@H2O
J. Kussmann Quantum Chemistry@GPU
Post-HF @ GPUs
Challenge
Less favorable scaling, conv. O(N5) at best (MP2)
Not integral evaluation, but linear algebra rate-determining
Porting CPU-algorithms shows small speedups only
Problem: DGEMM-speedup is rather small (ca. x 8)
Ansatz
Re-considering algorithms with GPUs in mind
First attempt: SOS-RI-MP2 [O(N4)]
[Jung/Shao/Head-Gordon, J. Comp. Chem. 12, 1953 (2007)]
J. Kussmann Quantum Chemistry@GPU
Post-HF @ GPUs: SOS-RI-MP2
EOSRI−MP2 = −
∑
ijab
∑
RSR′S′
(ia|R)[
J−1]
RS(S|jb)(ia|R′)
[
J−1]
R′S′(S′|jb)
ǫa + ǫb − ǫi − ǫj
JRS : two-center/two-electron integrals (aux. basis)
Laplace-Transform:
EOSRI−AO−MP2 = −
∑
α
∑
µνλσ
µ′ν′λ′σ′
∑
RSR′S′
Poccµµ′Pvirtνν
′Poccλλ
′Pvirtσσ
′
(µν|R)[
J−1
]
RS(S|λσ)(µ′ν′|R′)
[
J−1
]
R′S′
(S′|λ′σ′).
Evaluation via Intermediates:
ZRS =∑
µνµ′ν′
(R|µ′ν′)Poccµµ′Pvirtνν
′(µν|S) =∑
µν
(R|µν)(µν|S)
Correlation Energy: EOSRI−AO−MP2
= −∑
α
∑
RS ZRS ZSR with Z = ZJ−1
[Maurer/Kussmann/Ochsenfeld, submitted (2014)]
J. Kussmann Quantum Chemistry@GPU
Post-HF @ GPUs: SOS-RI-MP2 @ GPUs
Ansatz
Use Cholesky-factors of pseudo-densities & sparse algebra
O(N3)
Evaluate ZRS via J-engine on GPUs.
Algorithm
(1) Calculation of (R|µν) O(N2)
(2) Calculation of JRS = (R|S) O(N2)
(3) Calculation of J−1 O(N3)
(4) Calculation of pseudo-densities O(N3)
(5) Transformation of (R|µν) to (R|µν) O(N2)
(6) Contraction∑
µν(R|µν)(µν|S) (@ GPU) O(N3)
(7) Multiplication ZJ−1 O(N3)
(8) Contraction∑
RS ZRSZSR O(N2)
[Maurer/Kussmann/Ochsenfeld, submitted (2014)]
J. Kussmann Quantum Chemistry@GPU
SOS-RI-MP2: J-engine@GPU
J. Kussmann Quantum Chemistry@GPU
SOS-RI-MP2 @ GPU: Linear Alkanes
J. Kussmann Quantum Chemistry@GPU
SOS-RI-MP2 @ GPU: DNA
J. Kussmann Quantum Chemistry@GPU
Conclusions
Rethink algorithms, don’t simply transfer CPU-code
Coulomb: O(N2) J-engine, but small pre-factor
Efficient O(N) exchange evaluation on GPUs by PreLinK
Performance/Cost
(DNA16 @ HF/SVP, 1052 atoms, 11230 BF, 1 x Fock)
Q-Chem @ 8 CPU-cores: ∼ 30000 s (∼ 2000 e)
FermiONs++ @ 4 M2090: ∼ 2100 s (∼ 10000 e)
FermiONs++ @ 4 Titan: ∼ 500 s (∼ 8000 e)
∼ 60 x faster, 4 x more expensive
Fine-grained data-arrangement
strong-scaling parallelization
FermiONs++: Release 2014
J. Kussmann Quantum Chemistry@GPU
Acknowledgement
◮ Prof. Dr. C. Ochsenfeld
◮ Dr. Simon Maurer
◮ Group
Thank you for your attention...
J. Kussmann Quantum Chemistry@GPU