+ All Categories
Home > Documents > Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on...

Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on...

Date post: 22-Feb-2018
Category:
Upload: doandang
View: 216 times
Download: 1 times
Share this document with a friend
34
Ab Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures org Kussmann Theoretical Chemistry, University of Munich (LMU) 23rd May 2014 J. Kussmann Quantum Chemistry@GPU
Transcript
Page 1: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Ab Initio Quantum Chemistry

on Graphics Processing UnitsRethinking Algorithms for Massively Parallel Architectures

Jorg Kussmann

Theoretical Chemistry, University of Munich (LMU)

23rd May 2014

J. Kussmann Quantum Chemistry@GPU

Page 2: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Outline

Introduction

Challenges of Ab Initio Quantum Chemistry

Optimizing SCF-Algorithms @ GPUs

Data-Arrangement

Coulomb-, Exchange-, XC-Potential

Exchange Potential: GPU-specific optimization

Examplary Calculations: SCF & Properties

Hybrid MPI/CUDA Parallelization

Outlook: Post-HF Algorithms @ GPUs

Challenge

SOS-MP2 @ GPUs

J. Kussmann Quantum Chemistry@GPU

Page 3: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

PART 1: Ab Initio Methods

Schrodinger equation:

Molecular properties:

Energetics/Geometries

Vibrational frequencies

Electric properties

Magnetic properties

Dynamic properties

HΨ = i~Ψstat−→ HΨ = EΨ

Conventional methods:

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!

J. Kussmann Quantum Chemistry@GPU

Page 4: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

PART 1: Ab Initio Methods

Schrodinger equation:

Molecular properties:

Energetics/Geometries

Vibrational frequencies

Electric properties

Magnetic properties

Dynamic properties

HΨ = i~Ψstat−→ HΨ = EΨ

Conventional methods:

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!

J. Kussmann Quantum Chemistry@GPU

Page 5: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

PART 1: Ab Initio Methods

Schrodinger equation:

Molecular properties:

Energetics/Geometries

Vibrational frequencies

Electric properties

Magnetic properties

Dynamic properties

HΨ = i~Ψstat−→ HΨ = EΨ

Conventional methods:

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))

Aim: Reduce scaling to O(M)!

J. Kussmann Quantum Chemistry@GPU

Page 6: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Computational Effort: SCF Calculations

Roothaan-Hall: FC = SCǫ

Fµν = hcoreµν + Jµν [P] − (1 − a)Kµν [P] + V XC

µν [a, P]

Rate-determing steps:

1) Fock-Build O(N2)−→O(N)2) Diagonalization: F −→ C O(N3)−→O(N)

aaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Beer/Ochsenfeld, WIREs Comput Mol Sci 3, 614 (2013)]

{

a = 0 : HF0 < a < 1 : hybrid-DFTa = 1 : KS-DFT

Example: 16 A-T base pairs

HF/SVP (ϑint = 10−10, ϑconv = 10−7)

1052 atoms, 11230 basis functions

3 078 087 function pairs9.5 × 1012 primitive 2-e− integrals

O(N) Fock-Build (8 cores): 30 000 s(19 SCF-iterations for tight convergence)

J. Kussmann Quantum Chemistry@GPU

Page 7: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Moore’s Law: 1965-2010

Embrace new technologies: GPUs

J. Kussmann Quantum Chemistry@GPU

Page 8: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Moore’s Law: 1965-2010

Embrace new technologies: GPUs

J. Kussmann Quantum Chemistry@GPU

Page 9: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Implementation of GPU-algorithms

Automatic code generation

All double-precision, higher l-qn support

Coulomb

McMurchie-Davidson based J-engine

Pre/Post-processing on CPU

Ignore bra/ket symmetry (2 x integrals)

Exchange

McMurchie-Davidson

Evaluate complete integral on GPU

Exploit only 1 permutational symmetry (4 x integrals)

1 thread / 1 prim. integral: fine-grained data arrangement

[Ufimtsev/Martinez, JCTC 4, 222 (2008)]

J. Kussmann Quantum Chemistry@GPU

Page 10: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Coulomb Potential

J. Kussmann Quantum Chemistry@GPU

Page 11: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Exchange Potential

J. Kussmann Quantum Chemistry@GPU

Page 12: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Implementation of GPU-algorithms

Automatic code generation

All double-precision, higher l-qn support

Coulomb

McMurchie-Davidson based J-engine

Pre/Post-processing on CPU

Ignore bra/ket symmetry (2 x integrals)

Exchange

McMurchie-Davidson

Evaluate complete integral on GPU

Exploit only 1 permutational symmetry (4 x integrals)

Coulomb very fast, try to improve on exchange first...

A) Reduce scaling to linear

B) Reduce local memory effort

C) Reduce shared memory effort

J. Kussmann Quantum Chemistry@GPU

Page 13: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

A) PreLinK: O(N) Exact Exchange on GPUs

Problem: O(N) algorithms employ loads of book-keeping,

Problem: branching, communication

Loop: bra l-quantum number combination

Loop: ket l-quantum number combination

Loop: bra shell-pairs µ, λ

Determine sig. (µλ|σν) quartets:

QµλPmaxλσ Qσν ≥ ϑint + permutations

Loop: ket shell-pairs σ, ν

Evaluate: Kµν, Kµσ, Kλν, Kλσ

End Loop

End Loop

End Loop

Screening within inner loop

J. Kussmann Quantum Chemistry@GPU

Page 14: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

A) PreLinK: O(N) Exact Exchange on GPUs

Problem: O(N) algorithms employ loads of book-keeping,

Problem: branching, communication

Solution: Perform screening prior to integral evaluation by

Solution: pre-selection: PreLinK

J. Kussmann Quantum Chemistry@GPU

Page 15: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

A) PreLinK: O(N) Exact Exchange on GPUs

Problem: O(N) algorithms employ loads of book-keeping,

Problem: branching, communication

Solution: Perform screening prior to integral evaluation by

Solution: pre-selection: PreLinK

Kµν =∑

λσ

(µλ|νσ)Pλσ

Schwarz: (µλ|νσ) ≤ QµλQνσ =√

(µλ|µλ)√

(νσ|νσ)

PreLinK: Q′

µν =∑

λσ

QµλQνσ|Pλσ| ≥ Kµν

−→ Q′

= Q × |P| × Q

Determine significant elements of K from Q′

!

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Ochsenfeld, JCP 138, 134114 (2013)]

J. Kussmann Quantum Chemistry@GPU

Page 16: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

A) PreLinK: Pre-Selection Threshold

|P| Overestimation of K

16 α-D-glucose units, HF/SVP

J. Kussmann Quantum Chemistry@GPU

Page 17: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

A) PreLinK: Pre-Selection Threshold

Effect of pre-selection on final SCF energy

DNA-fragment with 4 A-T base-pairs, HF/SVP

(ϑconv = 10−7, ϑint = 10−10).

Errors in µHartree.

Error always below convergence criterion

J. Kussmann Quantum Chemistry@GPU

Page 18: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

A) PreLinK: Timings

Linear alkanes, HF/SV, max.: C640H1282

1-4 x NVidia M2090 (old generation, Kepler: approx. 3 x faster)

J. Kussmann Quantum Chemistry@GPU

Page 19: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

B) Improving the Exchange: Reduced Local Memory

16 A-T base pairs, HF/SVP (ϑint = 10−10, ϑpre = 10−3, 1 x GTX Titan)

Resort to Rys-quadrature for larger total l-qn

J. Kussmann Quantum Chemistry@GPU

Page 20: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

C) Improving the Exchange: Reduced Shared Memory

Shared Memory per thread-block

Most suitable size: 8x8 thread-blocks, use shared memory for Kµν

Ex.: d-shells (l-qn = 2), 48 kB shared memory

36 cartesian Kµν elements

Memory per thread-block: 8 x 8 x 8 (double) x 36 = 18.4 kB

Max. 2 thread-blocks per SMX, only 128 out of 192 cores

25 pure Kµν elements

Memory per thread-block: 8 x 8 x 8 (double) x 25 = 12.8 kB

Max. 3 thread-blocks per SMX, 192 out of 192 cores

Direct transformation to pure allows larger l-qn shells!

Ex.: 2 A-T base pairs, HF/TZVP

267 s (cart) vs 216 s (pure)

Significant impact: 20% speedup

Only ca. 7% of l-qn combinations affected

J. Kussmann Quantum Chemistry@GPU

Page 21: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Examplary Calculations: Water-Cluster

SCF Fock-Build and Nuclear Gradient (4 x GTX Titan, PBE0/SVP, 75/302)

PreLinK for Gradients [Kussmann/Ochsenfeld, in preparation]

J. Kussmann Quantum Chemistry@GPU

Page 22: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

NMR-Shieldings @ GPU

Timings: Water-Clusters (4 x GTX Titan, PBE0/SVP, 75/302)

Algorithm

dJ/dB: Reuse SCF-kernels with l + 1, different post-processing

dK/dB: Special GPU-kernels

K [dP/dB]: 6 x SCF-kernels (skew symmetry)

J. Kussmann Quantum Chemistry@GPU

Page 23: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

CIS/RPA @ GPU

Timings: Water-Clusters (4 x GTX Titan, PBE/SVP, 75/302)

J. Kussmann Quantum Chemistry@GPU

Page 24: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Hybrid MPI/CUDA Parallelization: SCF Calculations

HF/SVP (Single Fock-build, ϑint = 10−10, ϑpre = 10−3)

16 A-T base pairs (H2O)1123

Hardware/Parallelization

Per Node: 12 CPU cores (Intel E5-2620 v2 @ 2.0 GHz), 4 GTX Titan

Primitive Load-balancing, Master-Slave work distribution

1 Gb Ethernet

J. Kussmann Quantum Chemistry@GPU

Page 25: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Hybrid MPI/CUDA Parallelization: SCF Calculations

HF/SVP

16 A-T base pairs (H2O)1123

J. Kussmann Quantum Chemistry@GPU

Page 26: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Hybrid MPI/CUDA Parallelization: MutM@H2O

J. Kussmann Quantum Chemistry@GPU

Page 27: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Post-HF @ GPUs

Challenge

Less favorable scaling, conv. O(N5) at best (MP2)

Not integral evaluation, but linear algebra rate-determining

Porting CPU-algorithms shows small speedups only

Problem: DGEMM-speedup is rather small (ca. x 8)

Ansatz

Re-considering algorithms with GPUs in mind

First attempt: SOS-RI-MP2 [O(N4)]

[Jung/Shao/Head-Gordon, J. Comp. Chem. 12, 1953 (2007)]

J. Kussmann Quantum Chemistry@GPU

Page 28: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Post-HF @ GPUs: SOS-RI-MP2

EOSRI−MP2 = −

ijab

RSR′S′

(ia|R)[

J−1]

RS(S|jb)(ia|R′)

[

J−1]

R′S′(S′|jb)

ǫa + ǫb − ǫi − ǫj

JRS : two-center/two-electron integrals (aux. basis)

Laplace-Transform:

EOSRI−AO−MP2 = −

α

µνλσ

µ′ν′λ′σ′

RSR′S′

Poccµµ′Pvirtνν

′Poccλλ

′Pvirtσσ

(µν|R)[

J−1

]

RS(S|λσ)(µ′ν′|R′)

[

J−1

]

R′S′

(S′|λ′σ′).

Evaluation via Intermediates:

ZRS =∑

µνµ′ν′

(R|µ′ν′)Poccµµ′Pvirtνν

′(µν|S) =∑

µν

(R|µν)(µν|S)

Correlation Energy: EOSRI−AO−MP2

= −∑

α

RS ZRS ZSR with Z = ZJ−1

[Maurer/Kussmann/Ochsenfeld, submitted (2014)]

J. Kussmann Quantum Chemistry@GPU

Page 29: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Post-HF @ GPUs: SOS-RI-MP2 @ GPUs

Ansatz

Use Cholesky-factors of pseudo-densities & sparse algebra

O(N3)

Evaluate ZRS via J-engine on GPUs.

Algorithm

(1) Calculation of (R|µν) O(N2)

(2) Calculation of JRS = (R|S) O(N2)

(3) Calculation of J−1 O(N3)

(4) Calculation of pseudo-densities O(N3)

(5) Transformation of (R|µν) to (R|µν) O(N2)

(6) Contraction∑

µν(R|µν)(µν|S) (@ GPU) O(N3)

(7) Multiplication ZJ−1 O(N3)

(8) Contraction∑

RS ZRSZSR O(N2)

[Maurer/Kussmann/Ochsenfeld, submitted (2014)]

J. Kussmann Quantum Chemistry@GPU

Page 30: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

SOS-RI-MP2: J-engine@GPU

J. Kussmann Quantum Chemistry@GPU

Page 31: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

SOS-RI-MP2 @ GPU: Linear Alkanes

J. Kussmann Quantum Chemistry@GPU

Page 32: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

SOS-RI-MP2 @ GPU: DNA

J. Kussmann Quantum Chemistry@GPU

Page 33: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Conclusions

Rethink algorithms, don’t simply transfer CPU-code

Coulomb: O(N2) J-engine, but small pre-factor

Efficient O(N) exchange evaluation on GPUs by PreLinK

Performance/Cost

(DNA16 @ HF/SVP, 1052 atoms, 11230 BF, 1 x Fock)

Q-Chem @ 8 CPU-cores: ∼ 30000 s (∼ 2000 e)

FermiONs++ @ 4 M2090: ∼ 2100 s (∼ 10000 e)

FermiONs++ @ 4 Titan: ∼ 500 s (∼ 8000 e)

∼ 60 x faster, 4 x more expensive

Fine-grained data-arrangement

strong-scaling parallelization

FermiONs++: Release 2014

J. Kussmann Quantum Chemistry@GPU

Page 34: Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Acknowledgement

◮ Prof. Dr. C. Ochsenfeld

◮ Dr. Simon Maurer

◮ Group

Thank you for your attention...

J. Kussmann Quantum Chemistry@GPU


Recommended