+ All Categories
Home > Documents > Multiplying the speed-ups: GPU-accelerated, fast multipole ... · I II q 1 q 2 Multiplying the...

Multiplying the speed-ups: GPU-accelerated, fast multipole ... · I II q 1 q 2 Multiplying the...

Date post: 14-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
1
I II q 1 q 2 Multiplying the speed-ups: GPU-accelerated, fast multipole BEM, for applications in protein electrostatics. Lorena A Barba 1 , Rio Yokota 1 , Jaydeep P Bardhan 2 , Matthew G Knepley 3 1 Boston University, 2 Rush University Medical Center, 3 University of Chicago Far left: A lysozyme molecule surface, shown in transparency, with the atomic locations inside. The surface has 100k points. Left: An arrangement of 1000 such molecules, randomly placed inside a cubic volume. Lysozyme—abundant in secretions, such as tears and saliva—is part of the immune system and a natural form of protection from pathogens such as E.coli. With O(100k) panels on the surface, our FMM software is ~100x faster than direct summation on the GPU. On one GPU, the FMM-accelerated boundary element method (BEM) solver is ~10x faster than on one CPU—this is application speedup (not an inner kernel!). GPU Technology Conference, September 2010 A quote. It expresses our motivation The model. Continuum electrostatics “... the fundamental law of computer science [is]: the faster the computer, the greater the importance of speed of algorithms.” Trefethen & Bau, Numerical Linear Algebra, SIAM (1997) What’s new? Fast algorithms on the GPU The first wave of successful applications of GPUs to scientific computing was crowded with highly parallel methods. The paradigmatic example is molecular dynamics and other N-body simulations, where the embarrassingly parallel problem of calculating the all-pairs interactions exploits the fine-grained par- allelism of the hardware very well. But the easy pickings are running out. Many important applications involve intricate algorithms that require “going back to the algorithmic drawing board” (to quote an Intel blog) for a successful implementation on the GPU. This is the case of fast, O(N) algorithms like the fast multipole method (FMM). The FMM accelerates N-body problems by representing clusters of bodies with series expansions, and using a hierarchical tree structure to organize the bodies in space. There are various operations needed in a tree or FMM algorithm. We have ported all of them to the GPU. M2M multipole to multipole treecode & FMM M2L multipole to local FMM L2L local to local FMM L2P local to particle FMM P2P particle to particle treecode & FMM M2P multipole to particle treecode source particles target particles information moves from red to blue P2M particle to multipole treecode & FMM Molecular dynamics is very detailed and accurate, but spends too much time computing the surrounding water molecules. An alternative model considers the molecule and the water as continuum dielectric media. Point charges are placed at the location of atoms inside the molecule. This results in a mixed-dielectric Poisson problem. The problem can be written as a boundary-integral equation for charge density on the molecular surface: Who cares? Bioelectrostatics is important Method. Fast multipole BEM Want more? Papers and software are online P2P M2L M2P 0 50 100 150 200 250 300 350 GFlops N=10 4 N=10 5 N=10 6 N=10 7 Actual performance on the GPU of three core kernels, for four different values of N. The impact of GPU acceleration is greater for large problems. Also, the cross-over point between a direct evaluation and the FMM occurs at higher N on the GPU than CPU. For N > 4 x 10 4 , the FMM on the GPU is faster than the direct all- pairs evaluation. Results. Multi-GPU performance 1 2 4 8 16 32 64 128 0 20 40 60 80 100 120 140 160 180 200 N procs time x N procs [s] tree construction mpisendp2p mpisendm2l P2Pkernel P2Mkernel M2Mkernel M2Lkernel L2Lkernel L2Pkernel To demonstrate the capability on large problems, we use a large collection of ran- domly oriented lysozyme molecules, arranged inside a cubic volume. One such collection is shown in the figure above. This setup is meant to mimic Brownian dynamics of a crowded molecule environment. The largest calculation we conducted consists of 10,648 lysozyme molecules each surface discretized into 102,486 elements more than 20 million atoms over 1 billion unknowns This calculation required only ~1 min per iteration on 512 GPUs, using the cluster of the Nagasaki Advanced Computing Center, which was inexpensively built with 144 host nodes and 288 GTX 295 cards (PI: Prof. T. Hamada). The strong scaling of the FMM on multi-GPUs is shown below, up to 128 GPUs. Electrostatic interactions play a crucial role in the function of biological molecules. Functional properties that are ruled by electrostatics include: 1) electron transfer reactions involved in key transduction processes, e.g., photosynthesis 2) ligand binding to proteins involved in structure-based drug desing 3) enzyme catalysis involved in all chemistry of life! behind much of biotechnology, e.g., biofuels 4) protein folding and stability involved in diseases like Alzheimer’s All the codes developed in our group are free (like free beer) and open source. To download them, follow the links from our group website: http://barbagroup.bu.edu Also on the website are up-to-date bibliographic references, and papers for download. Please visit! The surface charge density reproduces the potential of the orginal problem, but in a homogeneous dielectric space. The boundary element method (BEM) solution of the integral equation problem results in a linear system with N unknows, with a dense matrix. Solving it with iterative methods would require O(N 2 ) calculations for the matrix-vector products. We have developed a fast multipole (BEM) for biomolecular electrostatics. With GPU acceleration of the FMM, there is a multiplicative speed-up resulting from the fast O(N) algorithm and GPU hardware. We can obtain converged results for multi- million atom system in less than an hour, using multi-GPU clusters. Now the bottleneck is generating the surface mesh. 0 50 100 150 200 tree construction mpisendp2p mpisendm2l P2Pkernel P2Mkernel M2Mkernel M2Lkernel L2Lkernel L2Pkernel 0 50 100 150 200 tree construction mpisendp2p mpisendm2l chunking task buffering data cudaSetDevice cudaMalloc cudaMemcpy cudaKernel II I q 1 q 2 I + - - - - - - - - - - + + + + + + + + + g I q 1 q 2
Transcript
Page 1: Multiplying the speed-ups: GPU-accelerated, fast multipole ... · I II q 1 q 2 Multiplying the speed-ups: GPU-accelerated, fast multipole BEM, for applications in protein electrostatics.

I

II

q1

q2

Multiplying the speed-ups: GPU-accelerated, fast multipole BEM,for applications in protein electrostatics.

Lorena A Barba1, Rio Yokota1, Jaydeep P Bardhan2, Matthew G Knepley3

1 Boston University, 2 Rush University Medical Center, 3 University of ChicagoFar left: A lysozyme molecule surface, shown in transparency,with the atomic locations inside. The surface has 100k points.Left: An arrangement of 1000 such molecules, randomly placedinside a cubic volume.

Lysozyme—abundant in secretions, such as tears and saliva—ispart of the immune system and a natural form of protection frompathogens such as E.coli.

With O(100k) panels on the surface, our FMM software is ~100x faster than direct summation on the GPU. On one GPU, the FMM-acceleratedboundary element method (BEM) solver is ~10x faster than on one CPU—this is application speedup (not an inner kernel!).

GPU

Tec

hnol

ogy

Con

fere

nce,

Sep

tem

ber 2

010

A quote. It expresses our motivation The model. Continuum electrostatics

“... the fundamental law of computer science [is]: the faster the computer, the greater the importance of speed of algorithms.”Trefethen & Bau, Numerical Linear Algebra, SIAM (1997)

What’s new? Fast algorithms on the GPU

The first wave of successful applications of GPUs to scientific computing wascrowded with highly parallel methods. The paradigmatic example is moleculardynamics and other N-body simulations, where the embarrassingly parallel problem of calculating the all-pairs interactions exploits the fine-grained par-allelism of the hardware very well. But the easy pickings are running out.

Many important applications involve intricate algorithms that require “goingback to the algorithmic drawing board” (to quote an Intel blog) for a successfulimplementation on the GPU. This is the case of fast, O(N) algorithms like thefast multipole method (FMM). The FMM accelerates N-body problems by representing clusters of bodies with series expansions, and using a hierarchicaltree structure to organize the bodies in space. There are various operations needed in a tree or FMM algorithm. We have ported all of them to the GPU.

M2Mmultipole to multipoletreecode & FMM

M2Lmultipole to localFMM

L2Llocal to localFMM

L2Plocal to particleFMM

P2Pparticle to particletreecode & FMM

M2Pmultipole to particletreecode

source particlestarget particles

information moves from red to blue

P2Mparticle to multipoletreecode & FMM

Molecular dynamics is very detailed and accurate, but spends too much timecomputing the surrounding water molecules. An alternativemodel considers the molecule and the water as continuum dielectric media. Point charges are placed at the location of atoms inside the molecule. This results in a mixed-dielectric Poisson problem.

The problem can be written as a boundary-integral equation for charge densityon the molecular surface:

Who cares? Bioelectrostatics is important

Method. Fast multipole BEM

Want more? Papers and software are online

P2P M2L M2P0

50

100

150

200

250

300

350

GFl

ops

N=104

N=105

N=106

N=107

Actual performance on the GPU of three core kernels,for four different values of N.

The impact of GPU acceleration isgreater for large problems. Also, thecross-over point between a directevaluation and the FMM occurs athigher N on the GPU than CPU.

For N > 4 x 104, the FMM on the GPU is faster than the direct all-pairs evaluation.

Results. Multi-GPU performance

1 2 4 8 16 32 64 1280

20

40

60

80

100

120

140

160

180

200

Nprocs

time

x N

proc

s [s]

tree constructionmpisendp2pmpisendm2lP2PkernelP2MkernelM2MkernelM2LkernelL2LkernelL2Pkernel

To demonstrate the capability on large problems, we use a large collection of ran-

domly oriented lysozyme molecules, arranged inside a cubic volume. One such

collection is shown in the figure above. This setup is meant to mimic Brownian

dynamics of a crowded molecule environment.

The largest calculation we conducted consists of

• 10,648 lysozyme molecules

• each surface discretized into 102,486 elements

• more than 20 million atoms

• over 1 billion unknownsThis calculation required only ~1 min per iteration on 512 GPUs, using the cluster

of the Nagasaki Advanced Computing Center, which was inexpensively built with

144 host nodes and 288 GTX 295 cards (PI: Prof. T. Hamada).

The strong scaling of the FMM on multi-GPUs is shown below, up to 128 GPUs.

Electrostatic interactions play a crucial role in the function of biologicalmolecules. Functional properties that are ruled by electrostatics include:

1) electron transfer reactions • involved in key transduction processes, e.g., photosynthesis2) ligand binding to proteins • involved in structure-based drug desing3) enzyme catalysis • involved in all chemistry of life! • behind much of biotechnology, e.g., biofuels4) protein folding and stability • involved in diseases like Alzheimer’s

All the codes developed in our group are free (like free beer) and open source.To download them, follow the links from our group website:

http://barbagroup.bu.edu

Also on the website are up-to-date bibliographic references, and papers fordownload. Please visit!

The surface charge density reproduces the potential of the orginal problem,but in a homogeneous dielectric space.

The boundary element method (BEM) solution of the integral equation problemresults in a linear system with N unknows, with a dense matrix. Solving it with iterative methods would require O(N2) calculations for the matrix-vectorproducts.

We have developed a fast multipole (BEM) for biomolecularelectrostatics. With GPU acceleration of the FMM, there is a multiplicative speed-up resulting from the fast O(N) algorithm and GPU hardware.We can obtain converged results for multi-million atom system in less than an hour, using multi-GPU clusters.

Now the bottleneck is generating the surface mesh.

0

50

100

150

200

cput

ime

[s]

tree constructionmpisendp2pmpisendm2lP2PkernelP2MkernelM2MkernelM2LkernelL2LkernelL2Pkernel

0

50

100

150

200

tree constructionmpisendp2pmpisendm2lchunking taskbuffering datacudaSetDevicecudaMalloccudaMemcpycudaKernel

II

I

q1

q2

I

+

− −

−−

− −

− −−

−+++

+

+ +

++

+g I

q1

q2

Recommended