+ All Categories
Home > Documents > Acceleware Performance Guide

Acceleware Performance Guide

Date post: 09-Apr-2018
Category:
Upload: stevegreaves
View: 215 times
Download: 0 times
Share this document with a friend

of 14

Transcript
  • 8/8/2019 Acceleware Performance Guide

    1/14

    ACCELEWARE FDTD PERFORMANCE GUIDE

    Nine easy ways to speed up your simulation

    - February 2010

    Logan Maxwell, Mike Weldon

  • 8/8/2019 Acceleware Performance Guide

    2/14

    Page 2 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Copyright Notice

    All material herein is Acceleware copyright and shall not be reproduced, copied, forwarded,

    published or shared in any matter without prior written authorization from Acceleware.

    All rights reserved. Acceleware, the Acceleware logo and wordmark are registered trademarks and

    /or trademarks of Acceleware Corp. in the United States, Canada and other countries. All other

    trademarks are the property of their respective owners.

  • 8/8/2019 Acceleware Performance Guide

    3/14

    Page 3 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Table of Contents

    Copyright Notice ____________________________________________________________ 2

    Table of Contents ____________________________________________________________ 3

    Overview ___________________________________________________________________ 4

    The Fundamentals of FDTD Performance on GPUs ________________________________ 5

    1)Perfectly Matched Layers (PML) _____________________________________________ 6

    2)Reads and Read Regions (Observations, DFT, Convergence, etc.) _________________ 7

    3)Screen Savers ____________________________________________________________ 8

    4)Simulation Orientation _____________________________________________________ 9

    5)Number of Materials ______________________________________________________ 10

    6)Dispersive Materials ______________________________________________________ 11

    7)Mesh Density ____________________________________________________________ 12

    8)Windows Remote Log In ___________________________________________________ 13

    9)Multi-GPU Systems _______________________________________________________ 14

  • 8/8/2019 Acceleware Performance Guide

    4/14

    Page 4 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Overview

    Introduction

    Fastest-possible FDTD simulation performance on GPU and multi-core hardware is a key objective

    for partners and end users alike. The hardware and Acceleware software libraries that make it run

    are obviously key determinants of the ultimate performance, but partners and end users can still

    have a large impact on the performance in ways that are not always obvious. This document

    outlines several key simulation parameters that impact simulation performance. Each case includes

    a brief description of the parameter, a plot illustrating the performance impact and tips on how to

    minimize any speed reduction. Improper use of these parameters whether intended or not, can

    reduce simulation speed by 50% or more. Understanding the suggestions in this document will help

    you avoid unnecessary reductions in performance and get the most out of all your simulations.

    Intended Audience

    Acceleware partners, end users and all those that are interested in running FDTD simulations

    optimally on GPUs and multi-core hardware. This document should be considered essential reading

    for engineers and scientists running FDTD simulations and will help make sure that they are always

    getting the most out of their simulation tools.

  • 8/8/2019 Acceleware Performance Guide

    5/14

    Page 5 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    The Fundamentals of FDTD Performance on GPUs

    The above chart shows single-GPU performance only. Faster throughput can be achieved bysharing the computations across multiple GPUs; typical scaling when doing so is 80-90 percent.

    Ramp Up: In this range the GPU is not using all of its compute resources andmemory bandwidth efficiently. Secondly, PML takes up a largeportion of the total simulation size and acts to slow the totalsimulation throughput.

    Knee: The knee is the point at which the performance levels off and theGPU is running optimally.

    Optimal Range: This is the optimal range because the GPU has found a good balancebetween computation and communication. The goal of any GPUFDTD code is to maximize this regions breadth and magnitude.

    GPU Memory Limit: This is the point at which the GPU runs out of memory and CPUbegins to solve the remaining calculations.

    Soft Memory: In this area the CPU is solving the remaining calculations that theGPU does not have memory for. As simulation size goes further intosoft memory, the performance will approach that of the CPU.

    How to calculate throughput performance:

    Note: Simulation Size is not including PML cells

    0

    200

    400

    600

    800

    0 25 50 75 100 125 150

    Pom

    Mc/s

    Simulation Size (Mcell)

    GPU (10 Series)

    CPU (Nehalem)

    CPU (Non-Nehalem)

    Optimal Range Soft MemoryRamp Up

    Acceleware SDK: Sanda (9.3.1.11545)

    GPU: NVIDIA Tesla C1060

    (Driver 6.14.11.9038)

    Observations: Off

    Simulation: Cubic, 16 Lossy Dielectrics,

    4 Layer PML

    Knee

    Memory Limit

  • 8/8/2019 Acceleware Performance Guide

    6/14

    Page 6 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Perfectly Matched Layers (PML)

    Overview:

    Adding PML (absorbing) boundary layers can reduce simulation performance by as much as

    50% which would double run the time. The maximum simulation size the GPU is capable of running

    will also be partially reduced. PML cells require more memory than non-PML cells. That reduces

    simulation size. They also are more expensive to compute, which reduces performance. More

    significantly, we don't include PML cells when calculating capacity or speed. Small simulations

    are impacted more than larger simulations because PML cells represent the majority of the

    computational load.

    Tips:

    - Minimize the number of layers of PML.

    - Understand how much PML your simulation requires and use no more than that.

    - Use maximum PML layers only when absolutely necessary.

    0

    200

    400

    600

    800

    0 25 50 75 100 125 150

    Pom

    Mcss

    Simulation Size (Mcells)

    PML Performance

    0

    10

    20

    PML Layers

    Acceleware SDK: Sanda (9.3.1.11545)

    GPU: NVIDIA Tesla C1060

    (Driver 6.14.11.9038)

    Observations: Off

    Simulation: Cubic, 16 Lossy Dielectrics

  • 8/8/2019 Acceleware Performance Guide

    7/14

    Page 7 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Reads and Read Regions

    (Observations, DFT, Convergence, etc.)

    Overview:

    Reading field data during a simulation can dramatically impact performance. Field data is

    read when observing simulation output, convergence, for DFTs etc. How much of the volume is read

    and how frequently the volume is read both impact simulation performance. The chart below shows

    performance for different volumes of reads based on a percentage of the total volume. All six fields

    are read for each cell. We are sweeping the number of time steps between each read.

    Tips:

    -Keep the read volume to a minimum; only observe the region (volume) that is of direct

    interest.

    - Read only as frequently as is necessary to achieve accurate power, DFT, SAR, optical

    generation, etc. results.

    - For optical generation, far field etc. only start to read after a simulation has converged.

    0

    100

    200

    300

    400

    500

    0 20 40 60 80 100

    Pom

    MCss

    All Fields Read Every X Time Steps

    0%

    25%

    50%

    75%

    100%

    % of Volume Read

    Read PerformanceAcceleware SDK: Sanda (9.3.1.11545)

    GPU: NVIDIA Tesla C1060

    (Driver 6.14.11.9038)

    Observations: Ex, Ey, Ez, Hx, Hy, Hz

    Simulation: 30 Mcells ,Cubic, 16 Lossy

    Dielectrics, 4 Layer PML

  • 8/8/2019 Acceleware Performance Guide

    8/14

    Page 8 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Screen Savers

    Overview:

    Screen savers, especially graphics intensive 3D types can decrease the performance of the

    simulation. Performance difference between no screen saver and basic screen saver is negligible.

    Smaller simulations experienced a greater percent decrease in performance. Occasionally,

    significantly worse performance is observed, and is abnormal.

    Tips:

    - Use low detail screen savers, blank, or no screen saver.

    - Use the management settings to turn off the monitor instead of using a screen saver.

    - If you must use a screen saver, confirm your performance is not degraded by more than 10-

    20%. If it is worse, please contact Acceleware

    0

    200

    400

    600

    800

    0 25 50 75 100 125 150

    Pom

    Mcss

    Simulation Size (Mcells)

    Screen Saver Performance

    None

    Blank

    3D Pipes

    Screen Saver

    Acceleware SDK: Sanda (9.3.1.11545)

    GPU: NVIDIA Tesla C1060

    (Driver 6.14.11.9038)

    Observations: Off

    Simulation: Cubic, 16 Lossy Dielectrics,

    4 Layer PML

  • 8/8/2019 Acceleware Performance Guide

    9/14

    Page 9 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Simulation Orientation

    Overview:

    Single-GPU simulations where Z is the smallest dimension by a significant margin will

    experience a decrease in simulation performance and maximum simulation size. This is due to the

    way in which memory is allocated, this problem is not unique to GPU FDTD solutions - it is also

    present in CPU-only FDTD solvers. The example below shows an extreme case of smallest

    dimension, for less extreme cases the decrease in performance and max simulation size is smaller.

    Partitioning across multiple GPUs will change the effective simulation dimensions on each GPU, and

    hence the performance. Smallest dimension in the graph is 10% of the other dimensions.

    Tips:

    - Rotate the simulation so that the Z is not the smallest dimension

    - Avoid extreme differences in dimension, cubic shows the best performance

    0

    200

    400

    600

    800

    0 25 50 75 100 125 150

    Pom

    Mcss

    Simulation Size (Mcells)

    Smallest Performance

    Cubic

    X Smallest

    Y Smallest

    Z Smallest

    Orientation:

    Acceleware SDK: Sanda (9.3.1.11545)

    GPU: NVIDIA Tesla C1060

    (Driver 6.14.11.9038)

    Observations: Off

    Simulation: 16 Lossy Dielectrics,

    4 Layer PML

    x y z

    x smallest (a, b, b)

    y smallest (b, a, b)

    z smallest (b, b, a)a

    b

    b

  • 8/8/2019 Acceleware Performance Guide

    10/14

    Page 10 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Number of Materials

    Overview:

    The number of materials can have a large impact on performance, up to a 20% decrease.

    The type of material can also have an effect on performance. For simulations with a variety of bothE

    and H materials, the performance drop is more severe.

    Tips:

    -If possible keep number of materials below 1024.

    - Make sure that all the materials are necessary; some applications add arbitrary complexity

    by continually varying the number of materials.

    0

    100

    200

    300

    400

    500

    600

    1 32 1024 32768

    Pom

    Mcss

    Unique Materials (#)

    Number of Materials Performance

    E Materials

    H Materials

    E and H Materials

    Acceleware SDK: Sanda (9.3.1.11545)

    GPU: NVIDIA Tesla C1060

    (Driver 6.14.11.9038)

    Observations: Off

    Simulation: 30 Mcells. Cubic,

    4 Layer PML

  • 8/8/2019 Acceleware Performance Guide

    11/14

    Page 11 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Dispersive Materials

    Overview:

    Dispersive materials have a large impact on simulation performance and maximum

    simulation size. Both the order (number of poles) of the dispersive materials and the total number of

    materials present will decrease performance. Higher order dispersive materials show worse

    performance, and higher numbers of dispersive materials will decrease performance. This applies to

    all dispersive materials types, Drude, Debye, Lorentz, Drude-Lorentz, etc. Dispersive materials also

    run slower on the CPU, the 'speed up factor' when using GPUs is roughly the same as for non-

    dispersive simulations.

    Case 1 1600 non-dispersive materials distributed evenly thought the entire simulation space

    Case 2 1 single-pole dispersive material occupies 40% of the total volume contiguously.

    Case 3 1 single-pole dispersive distributed evenly throughout the entire volume, 40% of the total volume is made up of dispersive

    materials.

    Case 4

    1600 Multi-pole dispersive materials distributed contiguously throughout 40% of the total volume.

    Case 5 1600 Multi-pole dispersive materials distributed evenly throughout the entire volume, 40% of the total volume is made up

    of dispersive materials.

    Tips:

    - Restrict the total volume of dispersive materials in any simulation.

    - Use the minimum number of dispersive materials and volume to achieve desired result.

    0

    100

    200

    300

    400

    500

    600

    700

    0 50 100

    Pom

    Mcss

    Simulation Size (Mcells)

    Dispersive Performance

    Case 1

    Case 2

    Case 3

    Case 4

    Case 5

    Acceleware SDK: Sanda (9.3.1.11545)GPU: NVIDIA Tesla C1060

    (Driver 6.14.11.9038)

    Observations: Off

    Simulation: Cubic, , 4 Layer PML,

  • 8/8/2019 Acceleware Performance Guide

    12/14

    Page 12 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Mesh Density

    Overview:

    Increasing the mesh density does not always yield more accurate results, however

    increasing the mesh density will always increase run time. This is for two reasons, one because

    there are more cells to compute, and two, because t in the simulation must also decrease to

    maintain simulation stability which increases the number of time steps required for a given number of

    periods. The chart below demonstrates the naive linear and actual effect of increasing mesh density

    on run time with a 10 Mcell simulation.

    Tips:

    -Only increase mesh density if your simulation accuracy requires it.

    0:00:00

    0:04:00

    0:08:00

    0:12:00

    0:16:00

    0:20:00

    0 10 20 30 40 50 60 70 80 90 100

    SmuaoTmehmms

    Simulation Size (Mcells)

    Time to complete 6 periods

    Actual

    Nave Linear

    Acceleware SDK: Sanda (9.3.1.11545)

    GPU: NVIDIA Tesla C1060

    (Driver 6.14.11.9038)

    Observations: Off

    Simulation: Cubic, 16 Lossy Dielectrics,

    4 Layer PML

  • 8/8/2019 Acceleware Performance Guide

    13/14

    Page 13 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Acceleware 2010

    Windows Remote Log In

    Overview:

    Remote desktop software can have a large impact on simulation performance; it can exceed

    a 50% reduction in speed, give an error or not run at all. This happens because the desktop is

    virtualized and in some cases access to the GPU is limited or nonexistent. The desktop uses GPU

    resources which are needed for computation.

    Tips:

    - Use a KVM as they have no impact on performance.

    - Do not use remote desktop tools in general

    -If absolutely necessary use Ultra VNC, which still has some performance decrease as shown

    above.

    0

    200

    400

    600

    800

    0 25 50 75 100 125 150

    Pom

    Mcss

    Simulation Size (Mcells/s)

    Ultra VNC Performance

    VNC OFF

    VNC ON

    Acceleware SDK: Sanda (9.3.1.11545)

    GPU: NVIDIA Tesla C1060

    (Driver 6.14.11.9038)

    Observations: Off

    Simulation: Cubic, 16 Lossy Dielectrics,

    4 Layer PML

  • 8/8/2019 Acceleware Performance Guide

    14/14

    Page 14 of 14ACCELEWARE FDTD PERFORMANCE GUIDE

    Multi-GPU Systems

    Overview:

    Using multiple GPUs can have a dramatic effect on performance and total allowable

    simulation size. The addition of GPUs to any configuration will add 80-90 percent performance and

    add approximately 95 percent to the total allowable simulation size. For example a simulation

    running on a single C1060 may get a throughput performance of 650 Mcells/s, if that same

    simulation were run on a 4xC1060 (S1070) configuration the performance would be

    650+650*0.85*3 = ~2,300 Mcells/s.

    Tips:

    - Small simulations in the ramp up range will experience a smaller scaling factor than

    simulations in the optimal range.

    - Multi-GPU systems are able to run multiple simultaneous simulations using GPU targeting.

    - Z smallest performance for multi-GPU systems will not be degraded to the same extent as

    for single-GPU systems.

    0

    1000

    2000

    3000

    4000

    5000

    0 250 500 750 1000

    om

    Mcss

    Simulation Size (Mcells)

    Multi-GPU PerformanceDual NVIDIA

    Tesla S1070

    NVIDIA Tesla

    S1070NVIDIA Quadro

    Plex 2200 D2NVIDIA Tesla

    C1060

    Acceleware SDK: Sanda (9.3.1.11545)

    Driver: 6.14.11.9038

    Observations: Off

    Simulation: Cubic, 4 Layer PML,

    16 Lossy Dielectrics

    0

    1000

    2000

    3000

    0 25 50 75 100

    om

    Mcss

    Simulation Size (Mcells)

    Multi-GPU Performance (Zoomed)Dual NVIDIA Tesla

    S1070NVIDIA Tesla

    S1070NVIDIA Quadro

    Plex 2200 D2NVIDIA Tesla

    C1060


Recommended