+ All Categories
Home > Documents > High performance computing with GROMACS GROMACS history

High performance computing with GROMACS GROMACS history

Date post: 06-Jan-2017
Category:
Upload: truongdiep
View: 232 times
Download: 5 times
Share this document with a friend
16
CBR High performance computing with GROMACS Berk Hess [email protected] Center for Biomembrane Research Stockholm University, Sweden Monday, October 19, 2009 GROMACS history Started in early 90’s in Groningen (Netherlands) Originally parallel hardware and software Initially, focus has been mainly on high performance on small numbers of processors Development of novel, efficient algorithms Highly efficient implementation The past few years: focus on parallel scaling GPL, http://www.gromacs.org Monday, October 19, 2009
Transcript
Page 1: High performance computing with GROMACS GROMACS history

CBR

High performance computing

with GROMACS

Berk [email protected] for Biomembrane ResearchStockholm University, Sweden

Monday, October 19, 2009

GROMACS history• Started in early 90’s in Groningen (Netherlands)

• Originally parallel hardware and software

• Initially, focus has been mainly on high performance on small numbers of processors

• Development of novel, efficient algorithms

• Highly efficient implementation

• The past few years: focus on parallel scaling

GPL, http://www.gromacs.org

Monday, October 19, 2009

Page 2: High performance computing with GROMACS GROMACS history

GROMACS

• GROningen MAchine for Chemical Simulation

• Core developers:David van der Spoel (Groningen � Uppsala)Berk Hess (Groningen � Mainz � Stockholm)Erik Lindahl (Stockholm)

• Newer developers:Gerrit Groenhof (Groningen � Göttingen)Carsten Kutzner (Göttingen)Roland Schultz (Oak Ridge)Sander Pronk (Stockholm). . .

Monday, October 19, 2009

Improving performance

• Increasing the time step:

• Use bond constraints, LINCS algorithm (2 fs)

• Remove H-vibrations with virtual sites (5 fs)

• Performance increase: factor 2 or more!

• Reducing the time per step

• Efficient algorithms and code

• Run in parallel over many processors

virtual sites: Feenstra, Hess, Berendsen, J. Comp. Chem. 20, 786 (1999)

LINCS: Hess, Bekker, Fraaije, Berendsen J. Comp. Chem. 18, 1463 (1997)

Monday, October 19, 2009

Page 3: High performance computing with GROMACS GROMACS history

GROMACS Approaches • Algorithmic optimization:• No virial in nonbonded kernels• Single precision by default (cache, BW

usage)• Tuning to avoid conditional statements

such as PBC checks• Triclinic cells everywhere: can save

15-20% on system size

• Optimized 1/sqrt(x)

• Used ~150,000,000 times/sec• Handcoded asm for ia32, x86-64,

ia64, Altivec, VMX, BlueGene (SIMD)

a0 a1 a2 a3

b0 b1 b2 b3

c0 c1 c2 c3

+

=

Monday, October 19, 2009

GROMACS 4.0

• GROMACS 4.0 released October 2008

0 16 32 48 64#cores

0

10

20

30

40

ns/d

ay

GROMACS 4.0

GROMACS 3.3

Hess, Kutzner, Van der Spoel, Lindahl; JCTC 4, 435 (2008)

Monday, October 19, 2009

Page 4: High performance computing with GROMACS GROMACS history

8th-shell decomposition

7

30

4cr

1

65

8th-shellhalf-shell 8th-shell midpoint

rc rc12rc

(a) (b) (c)

8th shell: Liem, Brown, Clarke; Comput. Phys. Commun. 67(2), 261 (1991)

Midpoint: Bowers, Dror, Shaw, JCP 124, 184109 (2006)

8th-shell 1/4 of the communication of half-shell

Monday, October 19, 2009

Dynamic load balancing

Triclinic, 2D example

• Causes of load imbalance:

• Atom imhomogeneity

• Inhomogeneous interaction cost

• Statistical fluctuation

• Full, 3D dynamic load balancing required

• Hardware cycle counters

11

2d

0

3 2

3’

rc

rb2’

Monday, October 19, 2009

Page 5: High performance computing with GROMACS GROMACS history

MPMD force calculation

•PME = rapid Ewald summation

•Ubiquitous in simulations today

•Small 3D FFT’s scale badly:All-to-All communication

•Real space & PME are independent

•Dedicate a subset of nodesto run a separate PME-only version of the programto improve scaling

Y

X

PME FFT over 4 cores

instead of 16 coresMonday, October 19, 2009

Parallel constraints•Constraints required

for 5 fs time steps

•Parallel LINCS algorithm: P-LINCS

•LINCS has a (short) finite interaction range

•First efficient parallel constraint algorithm

Hess; JCTC 4, 116 (2008)

Monday, October 19, 2009

Page 6: High performance computing with GROMACS GROMACS history

Flowcharts

Monday, October 19, 2009

Flowchartsread_data

Done

NO

output_step

update_r_and_v

more steps ?YES

compute_forces

*

*

reset_r_in_box

communicate_r

communicate_and_sum_f

3.3Monday, October 19, 2009

Page 7: High performance computing with GROMACS GROMACS history

Flowchartsread_data

Done

NO

output_step

update_r_and_v

more steps ?YES

compute_forces

*

*

reset_r_in_box

communicate_r

communicate_and_sum_f

3.3

Start

Communicate coordinates to construct virtual sites

Construct virtual sites

Neighborsearch step?

Domaindecomposition

Send charges to peer PME processor

Send x and box to peer PME processor

Communicate x with real space neighbor processors

(local)neighborsearching

Evaluate potential/forces

Receive forces/energy/virial from peer PME processor

Spread forces on virtual sites

Communicate forces from virtual sites

Integrate coordinates

Constrain bond lengths(parallel LINCS)

Sum energies of all real space processors

Neighborsearch step?

Communicate f with real space neighbor processors

All local coordinates received?

Receive x and box frompeer real space processors

Neighborsearch step?

Received charges from peer real space

processors

Communicate some atoms to neighbor PME proc's

Spread charges on grid

Communicate grid overlap with PME neighbor proc's

parallel 3D FFT

Solve PME (convolution)

parallel inverse 3D FFT

Communicate grid overlap with PME neighbor proc's

Interpolate forces from grid

Communicate some forces to neighbor PME proc's

Send forces/energy/virial to peer real space processors

More steps? More steps?

Stop

PME nodeReal space (particle) node

Y

N

N N

N

N

N

Y

Y

Y Y

Y

4.0

Monday, October 19, 2009

Performance (old slide)

1 µs in 3-4 weeks using 170 CPUs:50x longer than previously possibleCr

ay X

T4 @

CSC 200,000

atoms

Monday, October 19, 2009

Page 8: High performance computing with GROMACS GROMACS history

DLB in action

• 8x6=48 PP cores

• 16 PME cores

• protein: “slow”

• lipids: fast

Monday, October 19, 2009

Algorithm efficiency

• Protein system:

• T4-lysozyme

• H2O, Cl-

• 24199 atoms

• 1 nm cut-off

• PME0 8 16 24 32 40

#cores

0

10

20

30

40

50

ns/d

ay

t=4 fs, with virtual sitest=2 fst=2 fs, no load balancingt=2 fs, no MPMD PME

Monday, October 19, 2009

Page 9: High performance computing with GROMACS GROMACS history

Scaling limits

• Without Particle-Mesh-Ewald

• Weak scaling: no limit

• Strong scaling: ~300 atoms per core

• With Particle-Mesh-Ewald

• “1D”-PME, 100’s of cores

• GROMACS 4.1: “2D”-PME, 1000’s of cores

• GROMACS 5: Multi-grid?

Monday, October 19, 2009

Membrane protein

• Kv1.2/2.1 voltage-gatedion channel

• Open & closed state

• Contains a voltage sensor

• How does it work?

• Problem: transition is slow

• Energy barrier: ~10 kBT

• Long simulations: ms, µs Sophie Schwaiger

Monday, October 19, 2009

Page 10: High performance computing with GROMACS GROMACS history

Membrane protein

• Kv1.2/2.1 voltage-gatedion channel

• Open & closed state

• Contains a voltage sensor

• How does it work?

• Problem: transition is slow

• Energy barrier: ~10 kBT

• Long simulations: ms, µs Sophie Schwaiger

Monday, October 19, 2009

0.5 µs simulation

Monday, October 19, 2009

Page 11: High performance computing with GROMACS GROMACS history

0.5 µs simulation

Monday, October 19, 2009

RecentGROMACS & hardware

developments

Monday, October 19, 2009

Page 12: High performance computing with GROMACS GROMACS history

PDC PRACE prototype

• PRACE test machines with 4x6 AMD cores

• 24 core nodes connected with Infiniband

• Issue:24 cores sharea networkconnection

Monday, October 19, 2009

Global summation• Most thermo/barostats need global summation

• But this can be relatively VERY expensive

• Avoid when possible!

• GROMACS 3-stepsummation procedure:

• MPI_Reduce, 24 cores

• MPI_Allreduce, N nodes

• MPI_Bcast, 24 cores

Monday, October 19, 2009

Page 13: High performance computing with GROMACS GROMACS history

Global summation• Most thermo/barostats need global summation

• But this can be relatively VERY expensive

• Avoid when possible!

• GROMACS 3-stepsummation procedure:

• MPI_Reduce, 24 cores

• MPI_Allreduce, N nodes

• MPI_Bcast, 24 cores 24 2x24 4x24 8x240#cores

0

0.2

0.4

0.6

0.8

mill

isec

onds

Time for global summation

single MPI_Allreduce

3 MPI calls

Monday, October 19, 2009

PRACE scaling

0 50 100 150 200 250 300# cores

0

20

40

60

80

perf

orm

ance

(ns

/day

)

Gromacs scaling on 24-core AMD blade PRACE prototype331,776-atom system, reaction-field, 2fs steplength

Monday, October 19, 2009

Page 14: High performance computing with GROMACS GROMACS history

Multi-million atom biological system

• Cellulose, H2O, lignocellulosic biomass (biofuel)

• No charged groups -> reaction-field (no PME)

• 3.3 million atoms

Schultz, Linder, Petridis, Smith; JCTC (2009)

Monday, October 19, 2009

10k scaling

0.

Monday, October 19, 2009

Page 15: High performance computing with GROMACS GROMACS history

What is the limit?

• GROMACS 4.0: linear scaling algorithms

• But still practical limitations:

• File system access at start-up (fexist)

• Data distribution at start-up

• Still some O(#atoms) memory allocation

100M atoms? 100k cores?

Monday, October 19, 2009

A large machine

• Cray XT5

• 150 000+ AMD Opteron 2.3 GHz cores

• SeaStar 2+ interconnect

• Upgrade planned to 450 000 cores

JaguarPF at Oak Ridge

Monday, October 19, 2009

Page 16: High performance computing with GROMACS GROMACS history

Scaling to 150 000

• peptides + H2O

• 102M atoms

• Reaction-field

• 1.2 nm cut-off

• no DLB

0 50 000 100 000 150 000#cores

0

5

10

15

20

25

(ns/

day)

Monday, October 19, 2009

GROMACS outlook• Large systems:

• Improve electrostatics scaling

• Medium systems:

• Combine MPI with threads

• Small systems:

• Distributed computing:GROMACS on Folding@Home

Monday, October 19, 2009


Recommended