Acceleware Performance Guide

8/8/2019 Acceleware Performance Guide

1/14

ACCELEWARE FDTD PERFORMANCE GUIDE

Nine easy ways to speed up your simulation

- February 2010

Logan Maxwell, Mike Weldon


2/14

Page 2 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

Acceleware 2010

Copyright Notice

All material herein is Acceleware copyright and shall not be reproduced, copied, forwarded,

published or shared in any matter without prior written authorization from Acceleware.

All rights reserved. Acceleware, the Acceleware logo and wordmark are registered trademarks and

/or trademarks of Acceleware Corp. in the United States, Canada and other countries. All other

trademarks are the property of their respective owners.


3/14


Acceleware 2010

Table of Contents

Copyright Notice ____________________________________________________________ 2

Table of Contents ____________________________________________________________ 3

Overview ___________________________________________________________________ 4

The Fundamentals of FDTD Performance on GPUs ________________________________ 5

1)Perfectly Matched Layers (PML) _____________________________________________ 6

2)Reads and Read Regions (Observations, DFT, Convergence, etc.) _________________ 7

3)Screen Savers ____________________________________________________________ 8

4)Simulation Orientation _____________________________________________________ 9

5)Number of Materials ______________________________________________________ 10

6)Dispersive Materials ______________________________________________________ 11

7)Mesh Density ____________________________________________________________ 12

8)Windows Remote Log In ___________________________________________________ 13

9)Multi-GPU Systems _______________________________________________________ 14


4/14


Acceleware 2010

Overview

Introduction

Fastest-possible FDTD simulation performance on GPU and multi-core hardware is a key objective

for partners and end users alike. The hardware and Acceleware software libraries that make it run

are obviously key determinants of the ultimate performance, but partners and end users can still

have a large impact on the performance in ways that are not always obvious. This document

outlines several key simulation parameters that impact simulation performance. Each case includes

a brief description of the parameter, a plot illustrating the performance impact and tips on how to

minimize any speed reduction. Improper use of these parameters whether intended or not, can

reduce simulation speed by 50% or more. Understanding the suggestions in this document will help

you avoid unnecessary reductions in performance and get the most out of all your simulations.

Intended Audience

Acceleware partners, end users and all those that are interested in running FDTD simulations

optimally on GPUs and multi-core hardware. This document should be considered essential reading

for engineers and scientists running FDTD simulations and will help make sure that they are always

getting the most out of their simulation tools.


5/14


Acceleware 2010

The Fundamentals of FDTD Performance on GPUs

The above chart shows single-GPU performance only. Faster throughput can be achieved bysharing the computations across multiple GPUs; typical scaling when doing so is 80-90 percent.

Ramp Up: In this range the GPU is not using all of its compute resources andmemory bandwidth efficiently. Secondly, PML takes up a largeportion of the total simulation size and acts to slow the totalsimulation throughput.

Knee: The knee is the point at which the performance levels off and theGPU is running optimally.

Optimal Range: This is the optimal range because the GPU has found a good balancebetween computation and communication. The goal of any GPUFDTD code is to maximize this regions breadth and magnitude.

GPU Memory Limit: This is the point at which the GPU runs out of memory and CPUbegins to solve the remaining calculations.

Soft Memory: In this area the CPU is solving the remaining calculations that theGPU does not have memory for. As simulation size goes further intosoft memory, the performance will approach that of the CPU.

How to calculate throughput performance:

Note: Simulation Size is not including PML cells

0

200

400

600

800

0 25 50 75 100 125 150

Pom

Mc/s

Simulation Size (Mcell)

GPU (10 Series)

CPU (Nehalem)

CPU (Non-Nehalem)

Optimal Range Soft MemoryRamp Up

Acceleware SDK: Sanda (9.3.1.11545)

GPU: NVIDIA Tesla C1060

(Driver 6.14.11.9038)

Observations: Off

Simulation: Cubic, 16 Lossy Dielectrics,

4 Layer PML

Knee

Memory Limit


6/14


Acceleware 2010

Perfectly Matched Layers (PML)

Overview:

Adding PML (absorbing) boundary layers can reduce simulation performance by as much as

50% which would double run the time. The maximum simulation size the GPU is capable of running

will also be partially reduced. PML cells require more memory than non-PML cells. That reduces

simulation size. They also are more expensive to compute, which reduces performance. More

significantly, we don't include PML cells when calculating capacity or speed. Small simulations

are impacted more than larger simulations because PML cells represent the majority of the

computational load.

Tips:

- Minimize the number of layers of PML.

- Understand how much PML your simulation requires and use no more than that.

- Use maximum PML layers only when absolutely necessary.

0

200

400

600

800

0 25 50 75 100 125 150

Pom

Mcss

Simulation Size (Mcells)

PML Performance

0

10

20

PML Layers



(Driver 6.14.11.9038)

Observations: Off

Simulation: Cubic, 16 Lossy Dielectrics


7/14


Acceleware 2010

Reads and Read Regions

(Observations, DFT, Convergence, etc.)

Overview:

Reading field data during a simulation can dramatically impact performance. Field data is

read when observing simulation output, convergence, for DFTs etc. How much of the volume is read

and how frequently the volume is read both impact simulation performance. The chart below shows

performance for different volumes of reads based on a percentage of the total volume. All six fields

are read for each cell. We are sweeping the number of time steps between each read.

Tips:

-Keep the read volume to a minimum; only observe the region (volume) that is of direct

interest.

- Read only as frequently as is necessary to achieve accurate power, DFT, SAR, optical

generation, etc. results.

- For optical generation, far field etc. only start to read after a simulation has converged.

0

100

200

300

400

500

0 20 40 60 80 100

Pom

MCss

All Fields Read Every X Time Steps

0%

25%

50%

75%

100%

% of Volume Read

Read PerformanceAcceleware SDK: Sanda (9.3.1.11545)


(Driver 6.14.11.9038)

Observations: Ex, Ey, Ez, Hx, Hy, Hz

Simulation: 30 Mcells ,Cubic, 16 Lossy

Dielectrics, 4 Layer PML


8/14


Acceleware 2010

Screen Savers

Overview:

Screen savers, especially graphics intensive 3D types can decrease the performance of the

simulation. Performance difference between no screen saver and basic screen saver is negligible.

Smaller simulations experienced a greater percent decrease in performance. Occasionally,

significantly worse performance is observed, and is abnormal.

Tips:

- Use low detail screen savers, blank, or no screen saver.

- Use the management settings to turn off the monitor instead of using a screen saver.

- If you must use a screen saver, confirm your performance is not degraded by more than 10-

20%. If it is worse, please contact Acceleware

0

200

400

600

800

0 25 50 75 100 125 150

Pom

Mcss


Screen Saver Performance

None

Blank

3D Pipes

Screen Saver



(Driver 6.14.11.9038)

Observations: Off


4 Layer PML


9/14


Acceleware 2010

Simulation Orientation

Overview:

Single-GPU simulations where Z is the smallest dimension by a significant margin will

experience a decrease in simulation performance and maximum simulation size. This is due to the

way in which memory is allocated, this problem is not unique to GPU FDTD solutions - it is also

present in CPU-only FDTD solvers. The example below shows an extreme case of smallest

dimension, for less extreme cases the decrease in performance and max simulation size is smaller.

Partitioning across multiple GPUs will change the effective simulation dimensions on each GPU, and

hence the performance. Smallest dimension in the graph is 10% of the other dimensions.

Tips:

- Rotate the simulation so that the Z is not the smallest dimension

- Avoid extreme differences in dimension, cubic shows the best performance

0

200

400

600

800

0 25 50 75 100 125 150

Pom

Mcss


Smallest Performance

Cubic

X Smallest

Y Smallest

Z Smallest

Orientation:



(Driver 6.14.11.9038)

Observations: Off

Simulation: 16 Lossy Dielectrics,

4 Layer PML

x y z

x smallest (a, b, b)

y smallest (b, a, b)

z smallest (b, b, a)a

b

b


10/14


Acceleware 2010

Number of Materials

Overview:

The number of materials can have a large impact on performance, up to a 20% decrease.

The type of material can also have an effect on performance. For simulations with a variety of bothE

and H materials, the performance drop is more severe.

Tips:

-If possible keep number of materials below 1024.

- Make sure that all the materials are necessary; some applications add arbitrary complexity

by continually varying the number of materials.

0

100

200

300

400

500

600

1 32 1024 32768

Pom

Mcss

Unique Materials (#)

Number of Materials Performance

E Materials

H Materials

E and H Materials



(Driver 6.14.11.9038)

Observations: Off

Simulation: 30 Mcells. Cubic,

4 Layer PML


11/14


Acceleware 2010

Dispersive Materials

Overview:

Dispersive materials have a large impact on simulation performance and maximum

simulation size. Both the order (number of poles) of the dispersive materials and the total number of

materials present will decrease performance. Higher order dispersive materials show worse

performance, and higher numbers of dispersive materials will decrease performance. This applies to

all dispersive materials types, Drude, Debye, Lorentz, Drude-Lorentz, etc. Dispersive materials also

run slower on the CPU, the 'speed up factor' when using GPUs is roughly the same as for non-

dispersive simulations.

Case 1 1600 non-dispersive materials distributed evenly thought the entire simulation space

Case 2 1 single-pole dispersive material occupies 40% of the total volume contiguously.

Case 3 1 single-pole dispersive distributed evenly throughout the entire volume, 40% of the total volume is made up of dispersive

materials.

Case 4

1600 Multi-pole dispersive materials distributed contiguously throughout 40% of the total volume.

Case 5 1600 Multi-pole dispersive materials distributed evenly throughout the entire volume, 40% of the total volume is made up

of dispersive materials.

Tips:

- Restrict the total volume of dispersive materials in any simulation.

- Use the minimum number of dispersive materials and volume to achieve desired result.

0

100

200

300

400

500

600

700

0 50 100

Pom

Mcss


Dispersive Performance

Case 1

Case 2

Case 3

Case 4

Case 5

Acceleware SDK: Sanda (9.3.1.11545)GPU: NVIDIA Tesla C1060

(Driver 6.14.11.9038)

Observations: Off

Simulation: Cubic, , 4 Layer PML,


12/14


Acceleware 2010

Mesh Density

Overview:

Increasing the mesh density does not always yield more accurate results, however

increasing the mesh density will always increase run time. This is for two reasons, one because

there are more cells to compute, and two, because t in the simulation must also decrease to

maintain simulation stability which increases the number of time steps required for a given number of

periods. The chart below demonstrates the naive linear and actual effect of increasing mesh density

on run time with a 10 Mcell simulation.

Tips:

-Only increase mesh density if your simulation accuracy requires it.

0:00:00

0:04:00

0:08:00

0:12:00

0:16:00

0:20:00

0 10 20 30 40 50 60 70 80 90 100

SmuaoTmehmms


Time to complete 6 periods

Actual

Nave Linear



(Driver 6.14.11.9038)

Observations: Off


4 Layer PML


13/14


Acceleware 2010

Windows Remote Log In

Overview:

Remote desktop software can have a large impact on simulation performance; it can exceed

a 50% reduction in speed, give an error or not run at all. This happens because the desktop is

virtualized and in some cases access to the GPU is limited or nonexistent. The desktop uses GPU

resources which are needed for computation.

Tips:

- Use a KVM as they have no impact on performance.

- Do not use remote desktop tools in general

-If absolutely necessary use Ultra VNC, which still has some performance decrease as shown

above.

0

200

400

600

800

0 25 50 75 100 125 150

Pom

Mcss

Simulation Size (Mcells/s)

Ultra VNC Performance

VNC OFF

VNC ON



(Driver 6.14.11.9038)

Observations: Off


4 Layer PML


14/14


Multi-GPU Systems

Overview:

Using multiple GPUs can have a dramatic effect on performance and total allowable

simulation size. The addition of GPUs to any configuration will add 80-90 percent performance and

add approximately 95 percent to the total allowable simulation size. For example a simulation

running on a single C1060 may get a throughput performance of 650 Mcells/s, if that same

simulation were run on a 4xC1060 (S1070) configuration the performance would be

650+650*0.85*3 = ~2,300 Mcells/s.

Tips:

- Small simulations in the ramp up range will experience a smaller scaling factor than

simulations in the optimal range.

- Multi-GPU systems are able to run multiple simultaneous simulations using GPU targeting.

- Z smallest performance for multi-GPU systems will not be degraded to the same extent as

for single-GPU systems.

0

1000

2000

3000

4000

5000

0 250 500 750 1000

om

Mcss


Multi-GPU PerformanceDual NVIDIA

Tesla S1070

NVIDIA Tesla

S1070NVIDIA Quadro

Plex 2200 D2NVIDIA Tesla

C1060


Driver: 6.14.11.9038

Observations: Off

Simulation: Cubic, 4 Layer PML,

16 Lossy Dielectrics

0

1000

2000

3000

0 25 50 75 100

om

Mcss


Multi-GPU Performance (Zoomed)Dual NVIDIA Tesla

S1070NVIDIA Tesla

S1070NVIDIA Quadro

Plex 2200 D2NVIDIA Tesla

C1060

Date post:	09-Apr-2018
Category:	Documents
Upload:	stevegreaves
View:	215 times
Download:	0 times

Acceleware Performance Guide

Documents