“BEST BANG FOR YOUR BUCK” - Max Planck · PDF file‘Best bang for your...

Cost-efficient MD simulations

“BEST BANG FOR YOUR BUCK”

Carsten KutznerTheoretical & Computational BiophysicsMPI for biophysical Chemistry

HOW CAN I PRODUCE AS MUCH TRAJECTORY AS POSSIBLE FOR MY SCIENCE?

YOU HAVE ACCESS TO COMPUTE RESOURCES YOU WANT TO BUY A CLUSTER

fixed amountof core-h fixed amount

of money

TASK 1: CORE-H ➔ .XTC

HOW TO GET OPTIMAL GROMACS PERFORMANCE?

TASK 2: € ➔ .XTC

WHAT IS THE OPTIMAL HARDWARE TO RUN GROMACS ON?

COST-EFFICIENT MD SIMULATIONS

WHAT DO WE WANT? general-purpose cluster for all kinds of applications

▸ large RAM

▸ high-throughput, low-latency interconnect

▸ double-prec. GPU performance

▸ large GPU memory

WHAT IS THE ‘OPTIMAL’ HARDWARE TO BUY?


▸ large RAM




WHAT IS THE ‘OPTIMAL’ HARDWARE TO BUY?WHAT CAN WE SPARE? specialization maximizes cost-efficiency


▸ large RAM




WHAT CAN WE SPARE? specialization maximizes cost-efficiency

▸ GROMACS onlyX


XX

even a 2M atom system requires only 225 MB

RAM on the GPU


▸ large RAM




WHAT CAN WE SPARE? specialization maximizes cost-efficiency

▸ GROMACS onlyX

80%

max. sampling, many separate

simulations

single long trajectories

What we optimize our cluster for!

run these @ national HPC

centers


X

XX

X

even a 2M atom system requires only 225 MB

RAM on the GPU

20%

For us:

1. high performance-to-price ratio à maximize trajectory output per invested €

2. low energy consumption

3. good single-node performance

4. low rack space requirements

5. scaling across many cluster nodes à HPC centers

IMPO

RTAN

CEWHAT IS THE ‘OPTIMAL’ HARDWARE TO BUY?

X

get prices + benchmark GROMACS performance for all reasonable hardware configurations

‘Best bang for your buck’ (2015): 2 benchmark systems (80k / 2 M atoms), 12 CPU types 13 GPU types >50 hardware configurations

on each hardware try to get optimal GROMACS performance

COMPILATION: COMPILER, SIMD INSTRUCTIONS, MPI LIB

SYSTEM SETUP: V-SITES, BOX TYPE

MDRUN: FIND OPTIMAL RUN-TIME PARAMETERS

FINDING THE OPTIMAL HARDWARE

Best Bang for Your Buck: GPU Nodes for GROMACS

Biomolecular Simulations

Carsten Kutzner,*[a] Szil!ard P!all,[b] Martin Fechner,[a] Ansgar Esztermann,[a]

Bert L. de Groot,[a] and Helmut Grubm€uller[a]

The molecular dynamics simulation package GROMACS runs

efficiently on a wide variety of hardware from commodity work-

stations to high performance computing clusters. Hardware fea-

tures are well-exploited with a combination of single instruction

multiple data, multithreading, and message passing interface

(MPI)-based single program multiple data/multiple program

multiple data parallelism while graphics processing units (GPUs)

can be used as accelerators to compute interactions off-loaded

from the CPU. Here, we evaluate which hardware produces tra-

jectories with GROMACS 4.6 or 5.0 in the most economical way.

We have assembled and benchmarked compute nodes with var-

ious CPU/GPU combinations to identify optimal compositions in

terms of raw trajectory production rate, performance-to-price

ratio, energy efficiency, and several other criteria. Although

hardware prices are naturally subject to trends and fluctuations,

general tendencies are clearly visible. Adding any type of GPU

significantly boosts a node’s simulation performance. For inex-

pensive consumer-class GPUs this improvement equally reflects

in the performance-to-price ratio. Although memory issues in

consumer-class GPUs could pass unnoticed as these cards do

not support error checking and correction memory, unreliable

GPUs can be sorted out with memory checking tools. Apart

from the obvious determinants for cost-efficiency like hardware

expenses and raw performance, the energy consumption of a

node is a major cost factor. Over the typical hardware lifetime

until replacement of a few years, the costs for electrical power

and cooling can become larger than the costs of the hardware

itself. Taking that into account, nodes with a well-balanced ratio

of CPU and consumer-class GPU resources produce the maxi-

mum amount of GROMACS trajectory over their lifetime. VC 2015

The Authors. Journal of Computational Chemistry Published by

Wiley Periodicals, Inc.

DOI: 10.1002/jcc.24030

Introduction

Many research groups in the field of molecular dynamics (MD)

simulation and also computing centers need to make deci-

sions on how to setup their compute clusters for running the

MD codes. A rich variety of MD simulation codes is available,

among them CHARMM,[1] Amber,[2] Desmond,[3] LAMMPS,[4]

ACEMD,[5] NAMD,[6] and GROMACS.[7,8] Here, we focus on GRO-

MACS, which is among the fastest ones, and provide a com-

prehensive test intended to identify optimal hardware in terms

of MD trajectory production per investment.

One of the main benefits of GROMACS is its bottom-up per-

formance-oriented design aimed at highly efficient use of the

underlying hardware. Hand-tuned compute kernels allow utilizing

the single instruction multiple data (SIMD) vector units of most

consumer and high performance computing (HPC) processor plat-

forms while OpenMP multithreading and GROMACS’ built-in

thread- message passing interface (MPI) library together with non-

uniform memory access (NUMA)-aware optimizations allow for

efficient intranode parallelism. Using a neutral-territory domain-

decomposition (DD) implemented with MPI, a simulation can be

distributed across multiple nodes of a cluster. Beginning with ver-

sion 4.6, the compute-intensive calculation of short-range non-

bonded forces can be off-loaded to graphics processing unit

(GPUs), while the CPU concurrently computes all remaining forces

such as long-range electrostatics, bonds, so forth, and updates the

particle positions.[9] Additionally, through multiple program multi-

ple data (MPMD) task-decomposition the long-range electrostatics

calculation can be off-loaded to a separate set of MPI ranks for bet-

ter parallel performance. This multilevel heterogeneous paralleliza-

tion has been shown to achieve strong scaling to as little as 100

particles per core, at the same time reaching high absolute appli-

cation performance on a wide range of homogeneous and hetero-

geneous hardware platforms.[10,11]

A lot of effort has been invested over the years in software

optimization, resulting in GROMACS being one of the fastest

MD software engines available today.[7,12] GROMACS runs on a

wide range of hardware, but some node configurations pro-

duce trajectories more economically than others. In this study,

we ask: What is the “optimal” hardware to run GROMACS on

and how can optimal performance be obtained?

This is an open access article under the terms of the Creative Commons

Attribution License, which permits use, distribution and reproduction in

any medium, provided the original work is properly cited.

[a] C. Kutzner, M. Fechner, A. Esztermann, B. L. de Groot, H. Grubm€uller

Theoretical and Computational Biophysics Department, Max Planck Institute

for Biophysical Chemistry, Am Fassberg 11, 37077 G€ottingen, Germany

E-mail: [email protected]

[b] S. P!all

Theoretical and Computational Biophysics, KTH Royal Institute of

Technology, 17121, Stockholm, Sweden

Contract grant sponsor: DFG priority programme “Software for Exascale

Computing” (SPP 1648)

VC 2015 The Authors. Journal of Computational Chemistry Published by


1990 Journal of Computational Chemistry 2015, 36, 1990–2008WWW.CHEMISTRYVIEWS.COM

SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG

Coulomb + vdW make up for most of the time step

PME decomposes these into SR (direct) and LR (grid) contributions

PME allows to shift work between real, SR (PP), and reciprocal, LR (PME), space parts (balance cutoff : grid spacing)

GROMACS TIME STEP

Coulomb & van der Waals

interactions

more LR (CPU) workmore SR (GPU) work

3D FFT

solve PME

3D inverse FFT

SR forces

spread charges

bonded forces

interpolate forces

3D FFT

solve PME

3D inverse FFT

SR forces

spread charges

bonded forces

interpolate forces

neighbor searching neighbor searching

update coordinates update coordinates

3D FFT

solve PME

3D inverse FFT

SR forces

spread charges

bonded forces

interpolate forces

neighbor searching

update coordinates

domain decomp. domain decomp. domain decomp.

initial xi, vi

neighbor searching

step?

Y

N

direct space interactions decomposed into domains

neighbor searching

step?

initial xi, vi

3D FFT

solve PME

3D inverse FFT

neighbor searching

SR forces

spread charges

bonded forces

interpolate forces

update coordinates

Y

Nr2r1r0 r3

...

3D FFT

solve PME

3D inverse FFT

SR forces

spread charges

bonded forces

interpolate forces

3D FFT

solve PME

3D inverse FFT

SR forces

spread charges

bonded forces

interpolate forces



3D FFT

solve PME

3D inverse FFT

SR forces

spread charges

bonded forces

interpolate forces

neighbor searching

update coordinates

domain decomp. domain decomp. domain decomp.

initial xi, vi

neighbor searching

step?

Y

N

OpenMP thread

MPI rank

GROMACS TIME STEP / PARALLEL

r0 r1 r2

SR NON-BONDED FORCES ARE OFFLOADED TO GPUS, WITH AUTOMATIC BALANCING



domain decomp. domain decomp.

initial xi, vi

3D FFT

solve PME

3D inverse FFT

spread charges

interpolate forces

3D FFTsolve PME

3D inverse FFT

spread charges

interpolate forces

neighbor searching

step?

YN x, q

SR non-bondedforces

SR non-bondedforces

f, Ef, E

bonded forces bonded forces

x, q

more LR (CPU) work

more SR (GPU) work

MPI + OpenMPà work can be distributed in various ways

pure OpenMP performs well on single CPUs, but does not scale well across sockets

à on multi-socket nodes pure MPI is best

OpenMP+MPI adds overhead

2x 8-core E5-2690 (Sandy Bridge), RNAse protein, solvated, 24k atoms, PME, 0.9 nm cutoffs (Fig. taken from S Pall, MJ Abraham, C Kutzner, B Hess, E Lindahl, EASC 2014, Springer, 2015)

0 2 4 6 8 10 12 14 160

10

20

30

40

50

60

70

80OpenMP

MPI

MPI+OpenMP (two ranks)

#cores

pe

rfo

rma

nce

(n

s/d

ay)

Figure 3: Comparison of single-node simulation performance using MPI,OpenMP, and combined MPI+OpenMP parallelization. The OpenMP multi-threading (blue) achieves the highest performance and near linear scaling upto 8 threads deteriorating only when threads on OpenMP regions need tocommunicate across the system bus. In contrast, the MPI-only paralel runs(red), requiring less communication scale well across sockets. CombiningMPI and OpenMP parallelization with two ranks and varying number ofthreads (green) results in worse performance due to the added overhead ofthe two parallizations.The simulations were carried out on a dual-socket node with 8-core Intel XeonE5-2690 (2.8 GHz Sandy Bridge). Input system: RNAse protein, solvated ina rectangular box, 24k atoms, PME electrostatics, 0.9 nm cut-o↵.

11

CPU CPUCPU

CPU CPU

GPU GPU

CPU

GPUWith GPUs it is beneficial to have few large domains offloading their data to the GPU à use pure OpenMP unless multi-socket

Multi-socket GPU nodes à find optimum!

THE OPTIMAL MIX OF THREADS & RANKS

2x E5-2680v2 (2x 10 cores) processors with 4x GTX 980 GPUs

threadsranks

140

220

410

58

85104

202

140

220

410

58

85104

202

MEM RIB

no GPU

1 GPU

2 GPUs3 GPUs

4 GPUs

with DLBno DLBCPU nodes:

ü pure MPI

GPU nodes:ü several threads/rank

THE OPTIMAL MIX OF THREADS & RANKS

threadsranks

+30 %

Table 4: Some GPU models that can be used by GROMACS. The upper part of the table lists HPC-class Tesla cards, below are the consumer-class GeForce GTX cards. For the GTX 980 GPUs, cardsby different manufacturers differing in clock rate were benchmarked, + and ‡ symbols are used todifferentiate between them.

NVIDIA architec- CUDA clock rate memory SP throughput ⇡ pricemodel ture cores (MHz) (GB) (Gflop/s) (e) (net)

Tesla K20Xa Kepler GK110 2,688 732 6 3,935 2,800Tesla K40a Kepler GK110 2,880 745 12 4,291 3,100

GTX 680 Kepler GK104 1,536 1,058 2 3,250 300GTX 770 Kepler GK104 1,536 1,110 2 3,410 320GTX 780 Kepler GK110 2,304 902 3 4,156 390GTX 780Ti Kepler GK110 2,880 928 3 5,345 520GTX Titan Kepler GK110 2,688 928 6 4,989 750GTX Titan X Maxwell GM200 3,072 1,002 12 6,156GTX 970 Maxwell GM204 1,664 1,050 4 3,494 250GTX 980 Maxwell GM204 2,048 1,126 4 4,612 430GTX 980+ Maxwell GM204 2,048 1,266 4 5,186 450GTX 980‡ Maxwell GM204 2,048 1,304 4 5,341 450

aSee Figure 4 for how performance varies with clock rate of the Tesla cards, all other benchmarks have been donewith the base clock rates reported in this table.

GPU acceleration

GROMACS 4.6 and later supports CUDA-compatible GPUs with compute capability 2.0 or higher.

Table 4 lists a selection of modern GPUs including some relevant technical information. The single

precision (SP) column shows the GPU’s maximum theoretical SP flop rate, calculated from the

base clock rate (as reported by NVIDIA’s deviceQuery program) times the number of cores times

two floating-point operations per core and cycle. GROMACS exclusively uses single precision

floating point (and integer) arithmetic on GPUs and can therefore only be used in mixed precision

mode with GPUs. Note that at comparable theoretical SP flop rate the Maxwell GM204 cards yield

a higher effective performance than Kepler generation cards due to better instruction scheduling

and reduced instruction latencies.

Since the GROMACS CUDA non-bonded kernels are by design strongly compute-bound,3

GPU main memory performance has little impact on their performance. Hence, peak performance

13

GPU MODELS

NVIDIA architec- CUDA- clock rate memory SP throughput ⇡ pricemodel ture cores (MHz) (GB) (GFlop/s) (e net)

Tesla K40 Kepler GK110B 2 880 745 12 4 291 2 500Tesla P100 Pascal P100 3 584 1328 16 9 519 3 200

GTX 1060 Pascal GP106-400 1 280 1506 3 3 855 152GTX 1070 Pascal GP104-200 1 920 1506 8 5 783 330GTX 1080 Pascal GP104-400 2 560 1607 8 8 228 420GTX 1080Ti Pascal GP102-350-K1 3 584 1480 11 10 609 625

1

2017

2014

Table 5: Frequency of consumer-class GPUs exhibiting memory errors.

NVIDIA GPU memory # of cards # memtest # cardsmodel checker13 tested iterations with errors

GTX 580 memtestG80 1 10, 000 –GTX 680 memtestG80 50 4, 500 –GTX 770 memtestG80 100 4, 500 –GTX 780 memtestCL 1 50, 000 –GTX Titan memtestCL 1 50, 000 –GTX 780Ti memtestG80 70 4⇥ 10, 000 6GTX 980 memtestG80 4 4⇥ 10, 000 –GTX 980+

memtestG80 70 4⇥ 10, 000 2

980+). Error rates were close to constant for each of the four repeats over 10,000 iterations. We

strongly recommend to carry out these stress-tests and replace defective cards before using them

in production simulations.

Benchmarking procedure

Balancing the computational load takes mdrun up to a few thousand time steps at the beginning

of a simulation. During the load balancing phase performance is neither stable nor optimal, so we

excluded the first 1,000 – 10,000 steps from measurements using the -resetstep or -resethway

command line switches. Whereas execution on non-GPU nodes is under most circumstances faster

with activated DLB, on GPU nodes the situation is not so clear due to the competition between DD

and CPU-GPU load balancing mentioned in Section 2. We therefore tested both with and without

DLB in most of the GPU benchmarks.

The benchmarks were run for 2,000 – 15,000 steps, which translates to a couple of minutes

wall clock runtime for the single-node benchmarks. We aimed to find the optimal command-

line settings for each hardware configuration by testing the various parameter combinations as

mentioned in Section 2. On individual nodes with Nc cores, to evaluate criteria C1 – C2, we tested

the following settings using thread-MPI ranks:

(a) Nrank = Nc

15

CONSUMER GPU ERROR RATESconsumer GPUs do not have ECC memory, thus cannot correct for rare bit-flips

however, GPU stress tests can be used to sort out problematic GPUs

(13) I. S. Hague, V. S. Pande, In 10th IEEE/ACM International conference on cluster, cloud and grid computing: Stanford University, 2010

newer GTX 1060/70/80 GPUs seem to have comparable error rates

GeForce GTX TITAN

T

GPU freq

Consumer GPUs are optimized for acoustics:

their fan speed is limited to 60% of max

they reduce GPU frequency if too hot

affects performance!

see suppl. for how to fix GPU fan speed

GPU FREQUENCY THROTTLING

Best Bang for Your Buck: GPU Nodes for GROMACS

Biomolecular Simulations

Carsten Kutzner,*[a] Szil!ard P!all,[b] Martin Fechner,[a] Ansgar Esztermann,[a]

Bert L. de Groot,[a] and Helmut Grubm€uller[a]

The molecular dynamics simulation package GROMACS runs

efficiently on a wide variety of hardware from commodity work-

stations to high performance computing clusters. Hardware fea-

tures are well-exploited with a combination of single instruction

multiple data, multithreading, and message passing interface

(MPI)-based single program multiple data/multiple program

multiple data parallelism while graphics processing units (GPUs)

can be used as accelerators to compute interactions off-loaded

from the CPU. Here, we evaluate which hardware produces tra-

jectories with GROMACS 4.6 or 5.0 in the most economical way.

We have assembled and benchmarked compute nodes with var-

ious CPU/GPU combinations to identify optimal compositions in

terms of raw trajectory production rate, performance-to-price

ratio, energy efficiency, and several other criteria. Although

hardware prices are naturally subject to trends and fluctuations,

general tendencies are clearly visible. Adding any type of GPU

significantly boosts a node’s simulation performance. For inex-

pensive consumer-class GPUs this improvement equally reflects

in the performance-to-price ratio. Although memory issues in

consumer-class GPUs could pass unnoticed as these cards do

not support error checking and correction memory, unreliable

GPUs can be sorted out with memory checking tools. Apart

from the obvious determinants for cost-efficiency like hardware

expenses and raw performance, the energy consumption of a

node is a major cost factor. Over the typical hardware lifetime

until replacement of a few years, the costs for electrical power

and cooling can become larger than the costs of the hardware

itself. Taking that into account, nodes with a well-balanced ratio

of CPU and consumer-class GPU resources produce the maxi-

mum amount of GROMACS trajectory over their lifetime. VC 2015

The Authors. Journal of Computational Chemistry Published by


DOI: 10.1002/jcc.24030

Introduction

Many research groups in the field of molecular dynamics (MD)

simulation and also computing centers need to make deci-

sions on how to setup their compute clusters for running the

MD codes. A rich variety of MD simulation codes is available,

among them CHARMM,[1] Amber,[2] Desmond,[3] LAMMPS,[4]

ACEMD,[5] NAMD,[6] and GROMACS.[7,8] Here, we focus on GRO-

MACS, which is among the fastest ones, and provide a com-

prehensive test intended to identify optimal hardware in terms

of MD trajectory production per investment.

One of the main benefits of GROMACS is its bottom-up per-

formance-oriented design aimed at highly efficient use of the

underlying hardware. Hand-tuned compute kernels allow utilizing

the single instruction multiple data (SIMD) vector units of most

consumer and high performance computing (HPC) processor plat-

forms while OpenMP multithreading and GROMACS’ built-in

thread- message passing interface (MPI) library together with non-

uniform memory access (NUMA)-aware optimizations allow for

efficient intranode parallelism. Using a neutral-territory domain-

decomposition (DD) implemented with MPI, a simulation can be

distributed across multiple nodes of a cluster. Beginning with ver-

sion 4.6, the compute-intensive calculation of short-range non-

bonded forces can be off-loaded to graphics processing unit

(GPUs), while the CPU concurrently computes all remaining forces

such as long-range electrostatics, bonds, so forth, and updates the

particle positions.[9] Additionally, through multiple program multi-

ple data (MPMD) task-decomposition the long-range electrostatics

calculation can be off-loaded to a separate set of MPI ranks for bet-

ter parallel performance. This multilevel heterogeneous paralleliza-

tion has been shown to achieve strong scaling to as little as 100

particles per core, at the same time reaching high absolute appli-

cation performance on a wide range of homogeneous and hetero-

geneous hardware platforms.[10,11]

A lot of effort has been invested over the years in software

optimization, resulting in GROMACS being one of the fastest

MD software engines available today.[7,12] GROMACS runs on a

wide range of hardware, but some node configurations pro-

duce trajectories more economically than others. In this study,

we ask: What is the “optimal” hardware to run GROMACS on

and how can optimal performance be obtained?

This is an open access article under the terms of the Creative Commons

Attribution License, which permits use, distribution and reproduction in

any medium, provided the original work is properly cited.

[a] C. Kutzner, M. Fechner, A. Esztermann, B. L. de Groot, H. Grubm€uller

Theoretical and Computational Biophysics Department, Max Planck Institute

for Biophysical Chemistry, Am Fassberg 11, 37077 G€ottingen, Germany

E-mail: [email protected]

[b] S. P!all

Theoretical and Computational Biophysics, KTH Royal Institute of

Technology, 17121, Stockholm, Sweden

Contract grant sponsor: DFG priority programme “Software for Exascale

Computing” (SPP 1648)

VC 2015 The Authors. Journal of Computational Chemistry Published by


1990 Journal of Computational Chemistry 2015, 36, 1990–2008WWW.CHEMISTRYVIEWS.COM

SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG

GROMACS 4.6

CPU

nodes

nodes with

consumer-

class GPUs

u2

equal

perform

ance-to-p

rice

nodes with

Tesla GPUs

12 2

4

1

8

4

1

2

4

8

16

8

16

PERFORMANCE TO PRICE80K ATOMS

hard

ware

cos

ts (€

)

performance (ns/d)

2x as good

PERFORMANCE TO PRICE

adding the first GPU yields the largest performance benefit

80K ATOMS2M ATOMSGROMACS 4.6

PERFORMANCE TO PRICE 2017GROMACS 2016

80K ATOMS2M ATOMSx2

PERFORMANCE TO PRICE 2017GROMACS 2016

upcoming PME on GPU code

2M ATOMSx2

80K ATOMS

Over cluster lifetime, energy costs become comparable to hardware costs

assuming 5 yr of operation and 0.2 EUR / kWh (incl. cooling)

balanced CPU/GPU resources keep energy costs low

Table 1

Node ns/d microseconds power draw (W)

energy costs (Euro) node costs (Euro)

traj costs (Euro / microsecond)

just node just energy yield (ns per 1000 Euro)

2x E5-2670v2 1,38 2,5185 252 2207,52 3360 €2211 €1334 €877 €2211 452

2x E5-2670v2 + 780Ti 3,3 6,0225 519 4546,44 3880 €1399 €644 €755 €1399 715

2x E5-2670v2 + 2x 780Ti 3,87 7,06275 666 5834,16 4400 €1449 €623 €826 €1449 690

2x E5-2670v2 + 3x 780Ti 4,17 7,61025 933 8173,08 5430 €1787 €714 €1074 €1787 559

2x E5-2670v2 + 4x 780Ti 4,17 7,61025 960 8409,6 5950 €1887 €782 €1105 €1887 530

2x E5-2670v2 + 980 3,86 7,0445 408 3574,08 3780 €1044 €537 €507 €1044 958

2x E5-2670v2 + 2x 980 4,18 7,6285 552 4835,52 4200 €1184 €551 €634 €1184 844

2x E5-2670v2 + 3x 980 4,2 7,665 696 6096,96 5130 €1465 €669 €795 €1465 683

2x E5-2670v2 + 4x 980 4,2 7,665 840 7358,4 5550 €1684 €724 €960 €1684 594

2x E5-2680v2 1,86 3,3945 446 3906,96 4400 €2447 €1296 €1151 €2447 409

2x E5-2680v2 + 980 3,99 7,28175 622 5448,72 4850 €1414 €666 €748 €1414 707

2x E5-2680v2 + 2x 980 4,69 8,55925 799 6999,24 5300 €1437 €619 €818 €1437 696

2x E5-2680v2 + 3x 980 4,85 8,85125 926 8111,76 5750 €1566 €650 €916 €1566 639

2x E5-2680v2 + 4x 980 4,96 9,052 1092 9565,92 6200 €1742 €685 €1057 €1742 574

Trajectory production costs per microsecond

€0

€500

€1000

€1500

€2000

€2500

2x E

5-26

70v2

2x E

5-26

70v2

+ 7

80Ti

2x E

5-26

70v2

+ 2

x 78

0Ti

2x E

5-26

70v2

+ 3

x 78

0Ti

2x E

5-26

70v2

+ 4

x 78

0Ti

2x E

5-26

70v2

+ 9

80

2x E

5-26

70v2

+ 2

x 98

0

2x E

5-26

70v2

+ 3

x 98

0

2x E

5-26

70v2

+ 4

x 98

0

2x E

5-26

80v2

2x E

5-26

80v2

+ 9

80

2x E

5-26

80v2

+ 2

x 98

0

2x E

5-26

80v2

+ 3

x 98

0

2x E

5-26

80v2

+ 4

x 98

0

hardwareenergy

Trajectory costs per microsecond

2x E5-2680v2

2x E5-2680v2 + 1 GPU

2x E5-2680v2 + 2 GPUs

2x E5-2680v2 + 3 GPUs

2x E5-2680v2 + 4 GPUs

€0 €750 €1500 €2250 €3000

�1

energyhardware01234

GPUs

2x E5-2680v2 (2x 10 core) with GTX 980 GPUs, RIB benchmark

ENERGY EFFICIENCY

trajectory yield (ns / 1000 €)0 250 500 750 1000

2x E5-2670v2+1 GPU (GTX 780Ti)

2 GPUs3 GPUs

4 GPUs+1 GPU (GTX 980)

2 GPUs3 GPUs

4 GPUs

2x E5-2680v2+1 GPU (GTX 980)

2 GPUs3 GPUs

4 GPUs

Fixed budget trajectory yield taking into account energy + cooling (0.2 EUR / kWh) RIB

don’t add too many GPUs if you have

to pay for energy consumption

ENERGY EFFICIENCY

CONCLUSIONSbuying dedicated MD nodes boosts the performance to price ratio

Nodes with 1–2 consumer-class GPUs produce >2x as much trajectory as CPU nodes or nodes with “professional” Tesla GPUs

consumer GPUs with memory errors can be replaced, GPU throttling can be prevented by proper ventilation

Energy efficiency can be optimized by balancing the GPU to CPU compute power

upcoming PME-GPU code further enhances performance to price ratio, as it allows for cheaper CPUs

GROMACS 4.6

CPU

nodes

nodes with

consumer-

class GPUs

u2

equal

perform

ance-to-p

rice

nodes with

Tesla GPUs

12 2

4

1

8

4

1

2

4

8

16

8

16

THANKS FOR YOUR ATTENTION!

PEOPLE INVOLVED Martin Fechner, Szilard Pall, Timo Graen, Ansgar Esztermann, Markus Rampp, Aleksei Yupinov, Bert L de Groot, Helmut Grubmüller

Date post:	07-Feb-2018
Category:	Documents
Upload:	lycong
View:	215 times
Download:	0 times

“BEST BANG FOR YOUR BUCK” - Max Planck · PDF file‘Best bang for your...

Documents