Self-Adaptation of Time-Step-based Simulation Techniques ... · Store Bu er adressierten Daten in...

Self-Adaptation of Time-Step-based SimulationTechniques on Heterogeneous HPC Systems

Matthias Korch

University of BayreuthDepartment of Computer Science

NESUS MC and WG Meeting

University of Bayreuth10 April 2017

Outline

1 Introduction: Why self-adaptive software?

2 Overview of the SeASiTe project

3 Preliminary work at UBT

4 Summary and next steps

M. Korch @ NESUS Meeting, 10 April 2017 1/49

Outline

1 Introduction: Why self-adaptive software?Current landscape of HPC architecturesCurrent challenges for HPC softwareSelf-adaptive software and automatic tuning





Variety of hardware in the current Top 500 (11/2016)


Life phases of SuperMUC (LRZ Munich)HPC systems can grow during their lifetime.+ Another cause of heterogeneity.

Phase 1 Phase 2Installation Date 2011 2012 2013 2015Processor Westmere Sandy Bridge Ivy Bridge HaswellAcccelerator Phi KNC#Nodes 205 9216 32 3072#Cores 8,200 147,456 512 + 3,840 86,016Memory [TByte] 52 288 2.56 194#Islands 1 18 1 6Nodes per Island 205 512 32 512Proc. per Node 4 2 2 + 2 2Cores per Proc. 10 8 8 + 60 14Cores per Node 40 (80) 16 (32) 16 (32) + 28 (56)

120 (480)Interconnect IB QDR IB FDR10 IB FDR10 IB FDR14


Microarchitecture of Intel Haswell and Skylake

Haswell (Source: c’t 14/2013, p. 116. c© Heise Medien)

Know-how | AMD-Ryzen: Mikroarchitektur

c’t 2017, Heft 674

fen auftreten, um sie dann besonders e -zient und energiesparend abzuarbeiten –diese Einheit hat Zen nicht.

Stack-TrackerDafür ist eine neuartige Stack-Tracking-Einheit hinzugekommen, die im letztenJahr unter US20140379986 patentiertwurde. Sie erweitert ein Konzept, das Intelmit der Stack Engine schon beim PentiumM eingeführt hatte. Diese Stack Enginebesitzt einen „Schatten-Stackpointer“und lauscht unmittelbar an der µOp-Queue, ob es dort Operationen wie Push,Pop, Call oder Ret gibt. Solche Befehlewürden normalerweise einen zusätzlichenOpcode für das entsprechende Neusetzendes Stackpointers an die Scheduler schi-cken. Das erübrigt sich, da es nun im Vor-feld bereits die Stack Engine erledigt. Er-kennt sie dabei Stackpointer-Manipulati-onsbefehle, schleust sie zuvor ein Stack-pointer-Synchronisations-µOp ein.

Beim Zen lauscht die Engine auchauf Adressierungen relativ zum Stack-pointer und speichert beim Schreiben in-nerhalb eines kleinen O�set-Bereichs ineinem eigenen Memle auch Daten.Letztere liefert sie dann bei einem späte-ren Lesen als spekulatives Store-to-Load-Forwarding frühzeitig aus. Sollte jedochirgendwas dazwischenfunken und dasDatum zwischenzeitlich verändern, mussder Prozessor das erkennen. Er verwirftdann das gespeicherte Datum und lädt esneu nach. Damit das möglichst seltenpassiert, hat Zens Stack Engine einen ei-genen Prediktor.

Diese Stack Engine hat nichts mit demsogenannten Return Stack zu tun. DieserReturn Stack ist eng an die Sprungvorher-sageeinheit gekoppelt. Er merkt sich beiUnterprogramm-Aufrufen (Near Calls) dieReturn-Adresse, zu der das Programm inden allermeisten Fällen auch zurückkehrt.Return Stacks haben daher üblicherweiseeine sehr hohe Tre�erquote und die Fol-gebefehle werden dann ohne Zeitverlustausgeführt. Bei Zen besitzt der Stack 32Einträge – doppelt so viele wie beim Bull-dozer und vermutlich auch doppelt so vielewie Skylake/Kaby Lake.

Neuronale GlaskugelDie Sprungvorhersageeinheit wurde ge-genüber Bulldozer deutlich ausgebaut –wie weit, verrät AMD noch nicht genau:

nur, dass die Bu�er „large“ sind. DieSprungvorhersage arbeitet wie beim Bull-dozer mit zwei Branch Target Bu�ern(BTB) in zwei Leveln. Bei Bulldozer ist derLevel-1-BTB als Set-assoziativer Cachemit 128 Sets in vier Wegen organisiert, hatalso 512 Einträge. Der Level-2-BTB ist 5-Wege-assoziativ mit 1024 Sets, besitztfolglich 5120 Einträge. Ab Steamrollerwurde er nochmal vergrößert, mutmaß-

lich auf das Doppelte. Für indirekte Sprün-ge, etwa für eine Sprungtabelle, hat derZen-Prozessor einen weiteren Bu�er mit512 Einträgen.

Für jede Verzweigung merkt er sich inden BTBs nicht nur die vorhergesagten,sondern auch die alternativen Adressen.Für die Vorhersagestrategie kommt eineArt neuronales Netz zum Einsatz, auchHash-Perceptron genannt. Tri�t die „Re-

ITLB128(4K, 8-way)

8 per Thread (2/4 MB)

Icache Instruction Cache,VIPT,32 KByte, 8-way,

64 Byte Cacheline

L1-DTLB64 (4K,4-way)

32(2/4MB, 4-way)4(1G,4-way)

L1-Daten-Cache(32 KByte, 8-way)

4 Takte Latenz, 96 Bytes/Cycle

L2-TLB1536 (4K,2/4MB,12-way)

16 (1G,4-way)

Line FillBuffers

Prefetch-Buffer (32 Byte)

Inst

ruct

ion

Poin

ter

PredecoderMacroOp-Fusion

BPUBranch Prediction Unit

Return Stack (≥16)

DSB Decode Stream Buffer1536 µOPs, 8fach assoziativ (32 x 8 x 6)

6 µOPs/cycle (HW:4)

LSDLoop Stream Detector

RRTRetire Register File

IDQ Instruction Decode Queue64 Entries (HW: 28) pro log. Prozessor, 64 (HW:56) bei Single Thread

RS Reservation Station, Scheduler(97 Entries, HW: 60)

MITEx86-Decoder

5 µOps/cycle (HW: 4)

ALLOC MicroOp FusionAllocate, Rename

Move Elimination,Zeroidiom

ROBReorder Buffer & Retirement Unit

224 Entries, HW:192

2xRATRegister Alias Table

(PRF: 180 (HW: 168) Int, 168 FP)

Store Buffer(56 Entries, HW:42)

Memory Order Buffer (MOB) Result Data Bus

Load Buffer(72 Entries

HW:72)

MSMSROM

StoreData

AGUST

AGUST | LD

AGUST | LD

Int ALUInt ShftBranch1

Int ALUVec ALUVec Shuf

LEA

Int ALUInt MULVec AddVec ALUVec Shft

Vec FMA/FPLEA

Int ALUInt Div

Vec AddVec ALUVec Shft

Vec FMA/FPBranch 2

L2-Cache, unified,256 MByte, 4-Way,

Writeback,≥12 Takte Latenz,

64 Byte/Cycle(29 sustained)

Port 0Port 1Port 5Port 6

O-o-OOut of Order

Port 7Port 2Port 3Port 4

Common L3-Cache,unified, Writeback≤2 MByte per Core,≤16-way,

44 Takte Latenz,32 Byte/cyc

(18 sustained)

DDR4Memory-Controller

HD-GrafikISP:

Image Signal Processor

Uncore

256

256

512

256256 256 512

Skylake-Mikroarchitektur Die Kaby-Lake-Architektur hat nur minimale Abweichungen

© Copyright by Heise Medien

Persönliches PDF für Matthias Korch aus 06667 Weißenfels

Skylake (Source: c’t 6/2017, p. 74. c© Heise Medien)


Microarchitecture of AMD Bulldozer and Zen

Bulldozer (Source: c’t 25/2011, p. 161. c© Heise Medien)

AMD-Ryzen: Mikroarchitektur | Know-how

c’t 2017, Heft 6 75

cent Prediction“ zu, soll sie ohne zusätz-lichen Takt auskommen. Wichtig für diePerformance ist aber auch die „Strafzeit“bei falscher Vorhersage. Bei Bulldozer be-trug sie 20 Takte, Ryzen soll 3 Takteschneller sein. Intel hat Details zurSprungvorhersage zuletzt beim SandyBridge verö�entlicht, zu neueren Genera-tionen schweigt sich Intel aus. SandyBridge hatte 14 bis 17 „Straftakte“.

Die größten Unterschiede in den Ab-laufdiagrammen von Zen und Skylake/Kaby Lake folgen nach dem Dispatcher.AMD verteilt hier wie beim Bulldozer ei-nerseits Integer- und andererseits Gleit-komma-/SIMD-Befehle auf separate Pfa-de. Ab hier beginnt dann auch die Out-of-Order-Execution, nämlich mit dem Regis-ter-Renaming. Es löst durch eine Vielzahlphysischer Register viele Abhängigkeiten

zwischen den Befehlen auf, die durch denauf 32 Register beschränkten x86-Regis-tersatz gegeben sind. Anschließend lassensich die entkoppelten Befehle out-of-or-der parallel ausführen. Zen und Skylakeliegen hier mit 160 und mehr Registernfür Integer (Int) und Gleitkomma (FP)etwa gleichauf. Jede Funktionseinheit hatbei Zen einen eigenen Scheduler mit 14(Int) oder 24 Einträgen (FP). Hier wartendie Befehle, bis alle Abhängigkeiten vonRegistern oder Speicherreferenzen aufge-löst sind und die Funktionseinheit freiwird. Die beiden Threads wetteifern hier-bei um die Ressourcen.

Zuweilen wird hier auch „gezockt“,etwa mit Spekulationen wie dem Store-to-Load-Forwarding. Hierbei wird ein in derReihenfolge späterer Load-Befehl vor einStore gezogen, in der Ho�nung, dass sichdiese nicht auf die gleichen Adressen be-ziehen. Die Korrektheit dieser Annahmemuss vor dem sogenannten Commit nochüberprüft werden. Der Commit bringt densichtbaren logischen x86-Registersatz aufden neuesten Stand und schreibt die vomStore Bu�er adressierten Daten in denSpeicher.

Bei Intel verbleiben seit dem Penti-um Pro Int- und FP-Befehle in einer ge-meinsamen Reservation Station, die beiSky lake die ungewöhnliche Größe von 97Einträgen hat. Daran sind die Funktions-einheiten von Int, FP, SIMD über 8 Portsangeschlossen; Zen hat zusammenge-rechnet insgesamt 10 Ports. Doch jed -wede schöne Ausführungsbreite nütztnichts, wenn später beim Zurückschrei-ben der Ergebnisse über die RetirementUnit Stau entsteht. Das ist bei Zen indeseher selten zu erwarten, denn er kann8 Operationen pro Takt und Thread ab-schließen. Die Retirement Unit ist dabeistatisch in je 96 Einträge pro Thread auf-geteilt, gemeinsam für Int und FP. Sky -lake hat insgesamt 224 Einträge. Die Core-Architektur konnte nur 4 �-Ops proTakt abschließen, bei Skylake spricht Intel unscharf von „wider Hyper-Threa-ding Retirement“. Es sind vermutlich 4 �-Ops pro Thread und Takt.

Auf die physischen Register Files(PRF) greifen demnach viele Einheiten zu: Rename Unit, Retirement Unit, dieFunktionseinheiten und auch der StoreBu�er – mitunter alle gleichzeitig. Das isteine Riesenherausforderung für das PRF-

Zen-Mikroarchitektur Einer von vier Kernen des Compute Cluster Complex (CCX)

Icache, L1 Instruction Cache, 64 KByte, 4-way, 64 Byte Cacheline, Microtags für IC und Op-Cache

Microtags für IC und Op-Cache

BPU (Branch Prediction Unit)Recent Predictor, große L1/L2- BTBs, Return Stack (32 Entries), Hash Perceptron

Instruction Byte Buffer (32 Byte)

Pick

Inst

ruct

ion

Poin

ter

Phys

ical R

eque

st Q

ueue

2K Op-Cache4, 6 oder mehr Ops/Takt?

x86-Decoder4 µ-Ops/Takt

Micro-Op Queue72 (2x36)

Stack Engine (trackt Stack-Operationen)mit „predictive Memory File“

MSMSROM

Dispatch, Branch-Fusion (cmp + jmp etc.)

max. 6 µ-Ops/Takt max. 4 µ-Ops/Takt

Integer Rename Floating PointRename

I-PRF168 Entries12 L + 6 S

Integer Scheduler (6 Queues à 14 Entries)

AGQ0 AGQ1 ALQ0 ALQ1 ALQ2 ALQ3

F-PRF160 Entries

8 L + 4 S

Floating Point Non Scheduling Queue

Floating Point Scheduler(4 Queues à 24 Entries)

AGULD | ST

AGULD | ST

Int ALUBranch

Int ALUBranch

Int ALUInt MUL

Int ALUInt Div

FP128Bit

ADD

FP128Bit

ADD

FP128BitMULFMA

FP128BitMULFMA8 ops/

Takt

Ret

irem

ent

Que

ue (1

92 E

ntrie

s)

Result Data Bus Result Data Bus

Load Buffer(72 Entries)

Store Buffer(44 Entries)

L2-Cache, unified, 512 KByte, 8-Way,Writeback, (inclusive)

L1/L2-DTLBL1: 64 (alle Page-Größen)L2:1,5K (6 way) keine 1G

L1-Daten-Cache(32 KByte, 8 way, Write Back, 4 Takte Latenz)

Common L3-Cache pro CCX, unified, 8 MByte

16-way, Victime, „mostly“ exclusive

32 Bytes/Taktbidirectional

32 Bytes/Takt

ITLBL0: 8 (alle Page-Größen)L1: 64 (alle Page-Größen)L2: 512 (keine 1GB-Pages)

O-o-O

Kompetitiv aufgeteiltKompetitiv aufgeteilt und SMT tagged

Kompetiv aufgeteilt mit algorithmischer Prioritätstatisch aufgeteilt

insgesamt 6 µ-Ops pro Takt

256

128 128

256

© Copyright by Heise Medien

Persönliches PDF für Matthias Korch aus 06667 Weißenfels

Zen (Source: c’t 6/2017, p. 75. c© Heise Medien)


Xeon Phi “Knights Landing”

two-wide, out-of-order core that is derivedfrom the Intel Atom core (based on Silvermontmicroarchitecture).2 It includes heavy modifi-cations over the inherited microarchitecture toincorporate features necessary for HPC work-loads, such as four threads per core, deeperout-of-order buffers, higher cache bandwidth,new instructions, better reliability, larger trans-lation look-aside buffers (TLBs), and largercaches. KNL introduces the new AdvancedVector Extensions instruction set, AVX-512,which provides 512-bit-wide vector instruc-tions and more vector registers.3 In addition, itcontinues to support all legacy x86 instruc-

tions, making it completely binary-compatiblewith prior Intel processors.

KNL introduces a new 2D, cache-coherentmesh interconnect that connects the tiles,memory controllers, I/O controllers, andother agents on the chip. The mesh intercon-nect provides the high-bandwidth pathwaysnecessary to deliver the huge amount of mem-ory bandwidth provisioned on the chip to thedifferent parts of the chip. The mesh supportsthe MESIF (modified, exclusive, shared, inva-lid, forward) cache-coherent protocol.4 Itemploys a distributed tag directory to keep theL2 caches in all tiles coherent with each other.Each tile contains a caching/home agent thatholds a portion of the distributed tag directoryand also serves as a connection point betweenthe tile and the mesh.

KNL has two types of memory: multi-channel DRAM (MCDRAM) and doubledata rate (DDR) memory. MCDRAM is the16-Gbyte high-bandwidth memory compris-ing eight devices (2 Gbytes each) integratedon-package and connected to the KNL die viaa proprietary on-package I/O. All eightMCDRAMdevices together provide an aggre-gate Stream triad benchmark bandwidth ofmore than 450 Gbytes per second (GBps; see

MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM

2 × 161 × 4

X4

DMI

EDC

DDR MC DDR MC

EDC EDC EDC

EDC EDC misc

PCIe

Gen 3

EDC EDC

Tile

3 D

D R

4 c

hannels

3 D

D R

4 c

hannels

DM

I

2 VPU 2 VPU

Core

16-Gbyte

MCDRAM

KNLDDR 4 Omni-

Path

Omni-

Path

ports

100

Gbps/

port

Package

×16 PCIe*

×4 PCIe

CHA

1-Mbyte

L2Core

36 tiles

connected by

2D mesh

interconnect

(a) (c)

(b)

Figure 1. Knights Landing (KNL) block diagram: (a) the CPU, (b) an example tile, and (c) KNL with Omni-Path Fabric integrated

on the CPU package. (CHA: caching/home agent; DDRMC: DDRmemory controller; DMI: Direct Media Interface; EDC:

MCDRAM controllers; MCDRAM: multichannel DRAM; PCIe: PCI Express; VPU: vector processing unit.)

Figure 2. KNL CPU die photo.

.............................................................

MARCH/APRIL 2016 35

Source: Sodani et al., IEEE Micro, March/April 2016, c© IEEE


Sunway TaihuLight

10,649,600 cores, rmax = 93,014 TFLOP/s

SW26010 (Source: Fu et al., Science China – Information Sciences 59, 072001 (2016), c© Science China and Springer)


Time of use of CPUs in the Top 500: RISC era

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 20140

50

100

150

200

250

Year

#S

yste

ms

Cray

Intel IA−64

MIPS

Power

Sparc

Alpha

PA−RISC

PowerPC

Intel i860


Time of use of CPUs in the Top 500: Recent Intel CPUs

2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5 2016 2016.5 20170

50

100

150

200

250

300

350

Year

#S

yste

ms

Intel Nehalem

Intel SandyBridge

Intel IvyBridge

Intel Haswell

Intel Broadwell


Lifetime of Top 500 No. 1 systems

CM-5 LANL 06/1993–11/1998 5.5 yearsNumerical Wind Tunnel NAL Japan 11/1993–06/2002 8.5 yearsIntel XP/S 140 Paragon Sandia 06/1994–11/1998 4.5 yearsHitachi SR2201 Uni Tokyo 06/1996–11/2000 4.5 yearsCP-PACS Uni Tsukuba 11/1996–06/2003 6.5 yearsASCI Red Sandia 06/1997–11/2005 8.5 yearsASCI White LLNL 11/2000–06/2007 6.5 yearsThe Earth Simulator Earth Simulator Center 06/2002–11/2013 11.5 yearsBlueGene/L LLNL 11/2004–06/2012 7.5 yearsRoadrunner LANL 06/2008–11/2012 4.5 yearsJaguar Oak Ridge 11/2009–06/2012 2.5 yearsTianhe-1A National SCC Tianjin since 11/2010 > 6 yearsK Computer RIKEN since 06/2011 > 5.5 yearsSequoia LLNL since 11/2011 > 5 yearsTitan Oak Ridge since 11/2012 > 4 yearsTianhe-2 National SCC Guangzhou since 06/2013 > 3.5 yearsSunway TaihuLight National SCC Wuxi since 06/2016 > 0.5 years


Summary of HPC landscape

Number of cores has reached order of 106.

Big variety of systems, components (CPUs, accelerators,interconnects, . . . ), and software (OS, compiler, libraries, . . . )

Lifetime of systems and time of use of components is limited.

Heterogeneity due to:

Extension phases during lifetime of a HPC systems

Use of accelerators.

Use of heterogeneous processors.

Complex memory hierarchies.

Heterogeneous network topologies.


Challenges for HPC software

Needs ultra-scalable algorithms.

Needs to be tolerant of faults.

Needs to handle variety of platforms.

Needs to handle heterogeneity.

+ Needs platform-specific tuning.+ Sometimes even needs input-specific tuning.



Needs ultra-scalable algorithms. + Nearly embarrassingly parallel.























Needs to handle variety of platforms and inputs.
















+ Needs platform-specific tuning.

+ Sometimes even needs input-specific tuning.









Expectations on self-adaptivity

Handle some of the challenges for HPC software:

Avoid labor-intensive manual performance tuning.

Thus, provide portability of performance across different current andnew platforms.

Handle heterogeneous platforms which integrate several differentarchitectures (e.g., CPUs + accelerators).

Provide sensitivity to input.






























Examples of self-adaptive software: ATLAShttp://math-atlas.sourceforge.net

“Automatically Tuned Linear Algebra Software”Uses empirical tuning at installation time for dense linear algebra toadapt parameters (e.g., block sizes) and source code:

Selection between hand-tuned code variants.Source code generation (for matrix multiply (GEMM)).

Is Search Really Necessary to Generate High-Performance BLAS?[Yotov et al. 2005]


http://math-atlas.sourceforge.net

Examples of self-adaptive software: ATLAShttp://math-atlas.sourceforge.net

“Automatically Tuned Linear Algebra Software”Uses empirical tuning at installation time for dense linear algebra toadapt parameters (e.g., block sizes) and source code:

Selection between hand-tuned code variants.Source code generation (for matrix multiply (GEMM)).

Is Search Really Necessary to Generate High-Performance BLAS?[Yotov et al. 2005]


http://math-atlas.sourceforge.net

Examples of self-adaptive software: ActiveHarmonyhttp://www.dyninst.org/harmony

(a)

(b)

Fig. 2. Fig.2-(a) shows the overall online tuning workflow. Fig.2-(b) showsapplication level view of the auto-tuning workflow

increases, the number of alternative code-variants grows ex-ponentially. A brute-force approach of generating all possiblecombinations is, thus, not practical. Instead, our approachgenerates code variants on-demand. This selective approachrequires an efficient code generator that can produce correctcode during runtime.

Developing an efficient code transformation framework thatensures correctness is a large area of research in and ofitself. Recently, a variety of compiler-based code transforma-tion frameworks [7], [10], [4], [17] have been developed togive programmers and compiler experts greater control overwhat optimization strategies to use for complex codes. Theseframeworks are designed to facilitate manual exploration of thelarge optimization space of possible compiler transformationsand their parameter values. Each of these frameworks has itsstrengths and weaknesses and as such, it is desirable to designa system that can easily select among the available tools basedon our code transformation requirements.

Active Harmony relies on standalone code-generation utility(or code-servers) for on-demand code generation. Here wedescribe the two most important features of this utility. First,the design of code-servers allows the users to easily select andswitch between available code transformation tools. We sep-arate the search-based navigation of the code-transformationparameter space and the code-generation process, which al-lows us to easily switch between different underlying code-generation tools (e.g. if we are tuning CUDA code, we canswitch to a code-transformation framework that supports GPUsvia CUDA or OpenCL). Second, our code-generation utility

can take advantage of idle (possibly remote) machines fordistributed code-generation and compilation. Users provide aset of available machines at the start of the tuning session.These machines do the actual code-generation work. Once allcode-variants are generated, the compiled code-variants aretransported to the scratch filesystem of the parallel machine,where the application being tuned is executing. After the code-generation is complete, our code-generation utility notifies theActive Harmony server about the status.

Figure 2-(a) shows a schematic diagram of the workflowwithin our online tuning system. Figure 2-(b) shows theapplication-level view of the tuning process. At each searchstep, the Active Harmony server issues a request to thecode-servers to generate code variants with a given set ofparameters for loop transformations. The code-variants thatare generated are compiled into a shared library (denotedas v_N.so in the figure 2-(b)). Once the code-generationis complete, the application receives a code-ready messagefrom the Active Harmony server. The nodes allocated tothe parallel application then load the new code using thedlopen-dlsym mechanism. The new code is executed andthe measured performance values (denoted as PM_N in thefigure 2-(b)) are consumed by the Active Harmony server tomake simplex transformation decisions. The timing of actualloading of new code is determined by hooks (inserted using theActive Harmony API) in the application. For example, in mostprograms, we load new code only on timestep boundaries.

Preparing an application for auto-tuning starts with outliningthe compute-intensive code-sections to separate functions. Wethen insert appropriate calls to the outlined functions usingfunction pointers. These function pointers are updated whennew codes become available. Currently, the code-sections areoutlined manually. In the future, we intend to automate thisprocess using the ROSE compiler framework [12]. Each noderunning the application keeps track of the best code-variantit has seen thus far in the tuning process. If the code-serverfails to deliver new versions on time, the nodes continue theirexecution with the best version that they have discovered up tothat point in the tuning process. The period where no new codeis available is referred as search_stall (see figure 2-(b)).The non-blocking relationship between application executionand dynamic code-generation is important in minimizing theonline tuning overhead. The application does not have towait until the new code becomes available. Furthermore, thisasynchronous relationship enables our auto-tuner to exercisecontrol over what code-generation utility to use, how manyparallel code-servers to run and how many code-variants togenerate in any given search iteration. The policy decisionsabout what code-variants to generate and evaluate at eachiteration is made completely by the centralized tuning server.

V. EXPERIMENTS

In this section, we present an experimental evaluation ofour framework. First, we conduct a study using as a testapplication, a Poisson’s equation solver program, to determinethe least number of parallel code-servers needed to ensure

4

Source: Tiwari and Hollingsworth, IPDPS’11

Runtime compilation andtuning framework for parallelprograms [Hollingsworth, Tiwariet al., since 1998].

Several search strategies:parallel rank order,Nelder-Mead, exhaustive,random.

Uses CHiLL for code (loop)transformations.

Constraint language.

Code servers.


http://www.dyninst.org/harmony

Examples of self-adaptive software: Periscope TFhttp://periscope.in.tum.de

Close integration of performance analysis and tuning(makes use of Score-P).Tuning plugin interface.Several search strategies.Several plugins implemented:

Compiler Flags Selection Plugin,MPI Parameters Plugin,DVFS Plugin (Energy Tuning),Master Worker Plugin for tuning master worker applications,Parallel Pattern Plugin for tuning pipeline applications.

Pathway automatic performance engineering workflow system(Eclipse plugin).


http://periscope.in.tum.de

What self-adaptivity cannot do

The space of possibilities of adaptation is predetermined.

Hence, self-adaptivity cannot

invent new algorithms,

port your program to new programming languages or environments,

maintain your software for you (modernize code, add features, fixbugs, . . . ).






























Outline


2 Overview of the SeASiTe projectFunding and goalsPlanned approachContributors and expertise




The SeASiTe project

Self-Adaptation of Time-Step-based SimulationTechniques on Heterogeneous HPC Systems

Funded by the German Federal Ministryof Education and Research (BMBF)Program “IKT 2020 – Research forInnovations”Call “Foundation-oriented Research forHPC Software”3 universities, 6 PhD students, 3 years


Initial research questions

1 How can we exploit the increasing computational performance ofHPC systems?

2 How can we handle the increasing complexity and heterogeneity ofHPC systems?

3 How can we help application developers make their software fit forlong-term use?

4 How can we obtain portability of performance?

+ Simulation software should adapt itself, as far as possibleautonomously, to the – continuously changing – heterogeneous HPChardware.


Initial research questions

1 How can we exploit the increasing computational performance ofHPC systems?

2 How can we handle the increasing complexity and heterogeneity ofHPC systems?

3 How can we help application developers make their software fit forlong-term use?

4 How can we obtain portability of performance?

+ Simulation software should adapt itself, as far as possibleautonomously, to the – continuously changing – heterogeneous HPChardware.


Project goals

Systematic investigation of techniques for self-adaptation. Whichtechniques are suitable for time-step-based simulations onheterogeneous HPC systems?

Design of a tool set with which application developers can addself-adaptation capabilities to their applications. Implementation of aprototype.

Demonstrate the applicability of the developed techniques and toolsfor representative applications.


Planned approach

Inclusion of specific properties and the runtime behavior of thesimulation algorithms in the self-adaptation: Subdivision oftime-step-based algorithms into regular and irregular algorithms.+ Consideration of different classes of algorithms.Offline phase: Preselection of program variants on the basis of aperformance model (ECM model). Different optimization criteria(runtime, energy) selectable.Online phase: Probe selected program variants with the given inputdata during the first time steps.Progression monitor to record significant changes of runtime orenergy behavior during the simulation runs.Inclusion of multiple self-adaptation techniques to meet theindividual characteristics of different applications.


Selected classes of applications1 Stencil codes (regular algorithms):

Iteration over grid.Known access structure of the stencil, constant for all grid points(only except boundary).

+ Probably well predictable.

2 Explicit ODE solver (structured-irregular algorithms):Nested loop structure, transformations possible.Can be applied to different ODE systems.Method coefficients, number of stages, problem size. . . should beruntime parameters. (Maybe even the ODE system itself.)

+ Input sensitivity.3 Particle simulations (dynamic-irregular algorithms):

Approximation of distant particles for long-range interactions(tree codes or particle-mesh).Cut-off radius for short-range interactions (linked cell).

+ Irregular, continuously changing access pattern.+ Dynamic load balancing required.




+ Probably well predictable.2 Explicit ODE solver (structured-irregular algorithms):

Nested loop structure, transformations possible.Can be applied to different ODE systems.Method coefficients, number of stages, problem size. . . should beruntime parameters. (Maybe even the ODE system itself.)

+ Input sensitivity.

3 Particle simulations (dynamic-irregular algorithms):Approximation of distant particles for long-range interactions(tree codes or particle-mesh).Cut-off radius for short-range interactions (linked cell).





+ Probably well predictable.2 Explicit ODE solver (structured-irregular algorithms):

Nested loop structure, transformations possible.Can be applied to different ODE systems.Method coefficients, number of stages, problem size. . . should beruntime parameters. (Maybe even the ODE system itself.)

+ Input sensitivity.3 Particle simulations (dynamic-irregular algorithms):

Approximation of distant particles for long-range interactions(tree codes or particle-mesh).Cut-off radius for short-range interactions (linked cell).



Partners

1 Chemnitz University of TechnologyGroup “Practical Computer Science”Gudula Rünger

2 Friedrich-Alexander UniversityErlangen-NurembergGroup “High-Performace Computing”Gerhard Wellein

3 University of Bayreuth (coordinator)Group “Parallel and Distributed Systems”Thomas Rauber and Matthias Korch

4 MEGWARE GmbH (associated partner)Jürgen Gretzschel


TU ChemnitzField of work:

Implementation of scientific computing applications on HPC systemsFocus: irregular algorithms and application-independent libraries (e.g.,sorting, data redistribution, scheduling, and data exchange betweensimulation programs)

Role:

Contribute expertise in irregular algorithms and, in particular, particlesimulations.Contribute experience with the development of libraries and tools.

Contributed software:

Application codes of particle simulationsLibraries for efficient data redistribution and communication inparticle simulations


FAU Erlangen-Nuremberg

Field of work:

Operation of HPC clusters, user supportStructured, model-based performance engineering (ECM model)

Role:

Contribute expertise in model-based optimization of stencil-like codestructures.Contribute experience with the collection of performance information.

Software:

LIKWIDKerncraft


University of BayreuthField of work:

Foundations, technology, and programming of parallel and distributedsystemsScalability of algorithms, in particular from scientific computingFocus: parallel ODE methodsSelf-adaptivity and autotuning

Role:Contribute experience with self-adaptation and ODE methods.

Contributed software:Library of ODE solvers and test problems with support forself-adaptivity

PhD students:Johannes Seiferth, Markus Straubinger


MEGWAREField of work:

Construction and installation of networked computer systemsSince 2000, focus on HPC systems, in particular Linux clusters.Regularly, built systems reach Top500.

Role:Contribute experiences gathered from their large number ofcustomers.Dissemination of project results to customers and business partners toencourage their feedback.Provide early knowledge about new hardware trends.Discussion of possible hardware extensions to support self-adaptivity.Perhaps test of software developed in the project with selectedindustrial applications.


Outline



3 Preliminary work at UBTParallel numerical solution of ODE IVPsOverview work on optimization techniquesSelf-adaptivity and autotuning



ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.

Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.

Time steps

t

y

t

y

t0

y0

t

y

t0

y0

t

y

t0

y0

tet

y

t0

y0

t1

η1

tet

y

t0

y0

t1

η1

t2

η2

tet

y

t0

y0

t1

η1

t2

η2

t3η3

tet

y

t0

y0

t1

η1

t2

η2

t3η3

te

ηe

Components

t

y

t0 t1t

y

t0 t1 t2t

y

t0 t1 t2 t3t

y

t0 t1 t2 t3te

Definition (ODE IVP)y : R→ Rn,

y(t) = ?,y(t0) = y0,

f : R× Rn → Rn,

y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R


ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.

Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.

Time steps

t

y

t

y

t0

y0

t

y

t0

y0

t

y

t0

y0

tet

y

t0

y0

t1

η1

tet

y

t0

y0

t1

η1

t2

η2

tet

y

t0

y0

t1

η1

t2

η2

t3η3

tet

y

t0

y0

t1

η1

t2

η2

t3η3

te

ηe

Components

t

y

t0 t1t

y

t0 t1 t2t

y

t0 t1 t2 t3t

y

t0 t1 t2 t3te


y(t) = ?,y(t0) = y0,

f : R× Rn → Rn,

y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R


ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).

Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.

Time steps

t

y

t

y

t0

y0

t

y

t0

y0

t

y

t0

y0

tet

y

t0

y0

t1

η1

tet

y

t0

y0

t1

η1

t2

η2

tet

y

t0

y0

t1

η1

t2

η2

t3η3

tet

y

t0

y0

t1

η1

t2

η2

t3η3

te

ηe

Components

t

y

t0 t1t

y

t0 t1 t2t

y

t0 t1 t2 t3t

y

t0 t1 t2 t3te


y(t) = ?,y(t0) = y0,

f : R× Rn → Rn,

y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R


ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .

Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.

Time steps

t

y

t

y

t0

y0

t

y

t0

y0

t

y

t0

y0

te

t

y

t0

y0

t1

η1

tet

y

t0

y0

t1

η1

t2

η2

tet

y

t0

y0

t1

η1

t2

η2

t3η3

tet

y

t0

y0

t1

η1

t2

η2

t3η3

te

ηe

Components

t

y

t0 t1t

y

t0 t1 t2t

y

t0 t1 t2 t3t

y

t0 t1 t2 t3te


y(t) = ?,y(t0) = y0,

f : R× Rn → Rn,

y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R


ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).

n equations/components, where n can be large.

Time steps

t

y

t

y

t0

y0

t

y

t0

y0

t

y

t0

y0

te

t

y

t0

y0

t1

η1

te

t

y

t0

y0

t1

η1

t2

η2

tet

y

t0

y0

t1

η1

t2

η2

t3η3

tet

y

t0

y0

t1

η1

t2

η2

t3η3

te

ηe

Components

t

y

t0 t1t

y

t0 t1 t2t

y

t0 t1 t2 t3t

y

t0 t1 t2 t3te


y(t) = ?,y(t0) = y0,

f : R× Rn → Rn,

y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R


ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.

Time steps

t

y

t

y

t0

y0

t

y

t0

y0

t

y

t0

y0

te

t

y

t0

y0

t1

η1

te

t

y

t0

y0

t1

η1

t2

η2

tet

y

t0

y0

t1

η1

t2

η2

t3η3

tet

y

t0

y0

t1

η1

t2

η2

t3η3

te

ηe

Components

t

y

t0 t1

t

y

t0 t1 t2t

y

t0 t1 t2 t3t

y

t0 t1 t2 t3te


y(t) = ?,y(t0) = y0,

f : R× Rn → Rn,

y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R



Time steps

t

y

t

y

t0

y0

t

y

t0

y0

t

y

t0

y0

tet

y

t0

y0

t1

η1

te

t

y

t0

y0

t1

η1

t2

η2

te

t

y

t0

y0

t1

η1

t2

η2

t3η3

tet

y

t0

y0

t1

η1

t2

η2

t3η3

te

ηe

Components

t

y

t0 t1

t

y

t0 t1 t2

t

y

t0 t1 t2 t3t

y

t0 t1 t2 t3te


y(t) = ?,y(t0) = y0,

f : R× Rn → Rn,

y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R



Time steps

t

y

t

y

t0

y0

t

y

t0

y0

t

y

t0

y0

tet

y

t0

y0

t1

η1

tet

y

t0

y0

t1

η1

t2

η2

te

t

y

t0

y0

t1

η1

t2

η2

t3η3

te

t

y

t0

y0

t1

η1

t2

η2

t3η3

te

ηe

Components

t

y

t0 t1t

y

t0 t1 t2

t

y

t0 t1 t2 t3

t

y

t0 t1 t2 t3te


y(t) = ?,y(t0) = y0,

f : R× Rn → Rn,

y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R



Time steps

t

y

t

y

t0

y0

t

y

t0

y0

t

y

t0

y0

tet

y

t0

y0

t1

η1

tet

y

t0

y0

t1

η1

t2

η2

tet

y

t0

y0

t1

η1

t2

η2

t3η3

te

t

y

t0

y0

t1

η1

t2

η2

t3η3

te

ηe

Components

t

y

t0 t1t

y

t0 t1 t2t

y

t0 t1 t2 t3

t

y

t0 t1 t2 t3te


y(t) = ?,y(t0) = y0,

f : R× Rn → Rn,

y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R


Runge-Kutta methods

Runge 1895, Heun 1900, Kutta 1901:

v1 = f(tκ,ηκ)v2 = f(tκ + c2hκ,ηκ + hκa21v1)v3 = f(tκ + c3hκ,ηκ + hκ(a31v1 + a32v2))

...vs = f(tκ + cshκ,ηκ + hκ(as1v1 + · · ·+ as,s−1vs−1))

ηκ+1 = ηκ + hκ(b1v1 + · · ·+ bsvs)

+ Higher order than explicit Euler.


Implicit Runge-Kutta methods

Cauchy 1824 (implicit Euler method):

ηκ+1 = ηκ + hκ · f(tκ,ηκ+1)

Butcher 1964:

vl = f(tκ + clhκ,ηκ + hκs∑

i=1alivi) l = 1, . . . , s

+ Solution of a nonlinear system required, but better stability, suitablefor stiff ODE systems.


Further ODE methods

One-step methods:Extrapolation methodsIterated RK methods. . .

Multi-step methods:Adams methodsBDF methods. . .

General linear methodsPeer methods. . .

Waveform relaxation methods. . .


Sources of parallelism

Parallelism across the method:Try to find large independent tasks in the computational structure ofone time step (“task parallelism”).Example: independent stages

+ Usually only small-scale parallelism, e.g., peer methods [Schmitt andWeiner] use up to 8 processors.

Parallelism across the system:Distribute the equations (components) of the ODE system amongprocessors (“data parallelism”).

+ Potential for parallelism depends on ODE system.

Parallelism across time steps:Try to overlap some adjacent time steps.More parallelism: Picard iteration. + Often slow convergence.Promising approach: Parareal methods.


Sources of parallelismParallelism across the method:

Try to find large independent tasks in the computational structure ofone time step (“task parallelism”).Example: independent stages




















Previous research on optimization techniquesTarget platforms: CPU-based supercomputers and university clustersystems (Cray T3E, CLiC, JUMP, JUROPA, HLRB II, SGI UV,SuperMUC, . . . ); some experience with GPUs.Focus on loop transformations to improve locality.Tried to exploit the special structure of a large class of ODE systems(limited access distance).Pipeline-like loop structure enables cache reuse across stages andoften also storage space reduction.Methods considered:

Embedded RK methodsIterated RK methodsExtrapolation methodsAdams–Bashforth methodsPeer methods


Self-adaptivitySelected class of methods

Iterated RK methods

Y(0)l = yκ, l = 1, . . . , s, (predictor step)

k = 1, . . . ,m : (corrector step)

Y(k)l = yκ + hκ

s∑i=1

aliF(k−1)i , l = 1, . . . , s, (stages)

with F(k−1)i = f(tκ + cihκ,Y(k−1)

i )

yκ+1 = yκ + hκs∑

l=1blF(m)

l (new approximation vector)

yκ+1 = yκ + hκs∑

l=1blF(m−1)

l (embedded solution)


Self-adaptivityImplementation variantsIteration space:

k = 1, . . . ,m: corrector stepsl = 1, . . . , s: iteration over the vectors Y(k)

l (stages)i = 1, . . . , s: iteration over the summands of

∑si=1 aliF

(k−1)i

j = 1, . . . , n: iteration over the system dimension

Pre-existing Pthreads implementation variants:8 general variants:

Spatial locality, temporal locality for read/write accesses, loop tiling.Synchronization of all threads by barrier operations.

8 variants for f with limited access distance:Reduced memory consumption, pipelining of the corrector steps, looptiling, data distribution.Synchronization only with neighbors using locks.

NUMA taken into account.


Self-adaptivityImplementation variantsIteration space:

k = 1, . . . ,m: corrector stepsl = 1, . . . , s: iteration over the vectors Y(k)

l (stages)i = 1, . . . , s: iteration over the summands of

∑si=1 aliF

(k−1)i

j = 1, . . . , n: iteration over the system dimensionPre-existing Pthreads implementation variants:

8 general variants:Spatial locality, temporal locality for read/write accesses, loop tiling.Synchronization of all threads by barrier operations.

8 variants for f with limited access distance:Reduced memory consumption, pipelining of the corrector steps, looptiling, data distribution.Synchronization only with neighbors using locks.

NUMA taken into account.M. Korch @ NESUS Meeting, 10 April 2017 38/49

Self-adaptivityChallenges

Basic problem:Performance of implementation variants depends on IVP to be solved(in particular, access pattern and computational structure of f as wellas system dimension).

+ Cannot decide at compile time which variant will be the fastest.+ Have to make this decision at runtime (online tuning).

Challenges in doing this:Large search space:

Large number of possible implementation variants.Large number of possible parameter values (e.g., for loop tiling).

Large runtime differences of parallel variants due to different scaling.+ Have to keep runtime lost by searching for best configuration as small

as possible.








as possible.








as possible.


Self-adaptivityApproach developed

1 Offline profilingCollect information about runtimes of some input-independent programparts.

2 Preselection phaseFilter out implementation variants which are not applicable to the IVPor other runtime parameters.

3 Online profiling phaseCollect information about runtimes of some input-dependent programparts.

4 Autotuning computation phaseCompute first time steps with different different implementationvariants and parameters to find the fastest configuration.

5 Fast computation phaseCompute the remaining time steps with the best configuration found.






























Self-adaptivityImplementation for the solution method selected

1 Offline profilingRuntime of barrier operations

2 Preselection phaseSpecialized implementation variants, number of equations per thread

3 Online profiling phaseMemory copy operations inside step control which are only used insome implementation variants

4 Autotuning computation phaseImplementation variants are ordered according to number of barrieroperations. + Exclude slow variants.Empirical search on a set of block sizes for loop tiling preselected by aworking set model.

5 Fast computation phase






























Self-adaptivityTile size selection for loop tiling

Basis for the model: manually derived functions of problem size n,access distance d and number of threads p that describe the numberof vector elements accessed within loops (working sets).

Block sizes B1,B2, . . . ,BN are determined for relevant working setsand all cache levels.

Assumption for shared caches: every thread gets same share.

Empirical search samples preselected block sizes in increasing order(Bi < Bj for 1 ≤ i < j ≤ N) until no more improvement in runtime.

If smallest block size (B1) was best, divide by 2 until no moreimprovement in runtime (bisection method).
















Self-adaptivityInfluence of the tile size, working spaces

0 1 2 3 4 5 6 7 8

x 104

2.4

2.6

2.8

3

3.2

3.4x 10

−6 AMD Opteron DP246, BRUSS2D, n=4.5*106, Lobatto III C(8), sequentiell

Tile size

Runtim

e p

er

ste

p a

nd c

om

ponent

PipeDb2mt

WS2, L1

WS3, L1

WS1, L2

WS2, L2

WS3, L2


Self-adaptivitySuccess of the tile size selection, influence of vectorization

BRUSS2D, n = 8 · 106, Lobatto III C(8), Intel Core i7-4770 (Haswell), 4 threads, ICC 14.0.1

101

102

103

104

105

106

0

50

100

150

200

Tile Size

Runtim

e d

egra

dation (

%)

PipeDb2mt, vectorizedselected tile size

101

102

103

104

105

106

0

50

100

150

Tile Size

Runtim

e d

egra

dation (

%)

PipeDb2mt, not vectorized selected tile size

Best block size Worst block sizeVectorized 128 → 0,817 s 1 048 576 → 2,360 sNot vectorized 256 → 0,977 s 1 048 576 → 2,393 s


Self-adaptivitySuccess of auto-tuning

STRING problem, n = 8 · 106, SGI UV


Self-adaptivitySuccess of auto-tuning

Implement. Tile Size TPS ]Steps Time for ]Impl. ]Steps Speedupn p Selected Selected Sel. (s) for AT AT (s) Eval. 5% Ovhd. Worst/Sel.

BRUSS2D on 4-socket AMD Opteron 6172 system (“Leo”)5 · 105 12 ppDb1mt 568 0.071 44 4.8 16 467 3.35 · 105 48 ppDb1mt 280 0.019 46 1.0 16 146 17.58 · 106 12 PipeDb1mtyl 728 1.250 53 111.2 16 720 5.58 · 106 48 PipeDb1mtyl 360 0.320 51 24.9 16 537 4.2STRING on 4-socket AMD Opteron 6172 system (“Leo”)5 · 105 12 ppDb1mt 608 0.031 59 4.1 16 1450 6.95 · 105 48 ppDb1mtyl 360 0.008 45 0.5 16 409 3.48 · 106 12 ppDb1mtyl 608 0.512 54 82.1 16 2126 13.38 · 106 48 ppDb1mtyl 560 0.127 50 18.7 16 1950 9.7BRUSS2D on SGI UV5 · 105 80 PipeDb2mt 1224 0.007 31 0.4 12 560 4.45 · 105 480 PipeDb1mt 208 0.006 20 1.3 9 4308 4.78 · 106 80 PipeDb1mtyl 1224 0.108 61 23.2 16 3077 16.38 · 106 480 PipeDb2mt 1840 0.028 43 5.0 12 2720 6.2STRING on SGI UV5 · 105 80 ppDb1mt 280 0.004 46 0.4 16 974 6.05 · 105 480 ppDb1mt 456 0.003 29 1.2 8 6720 9.38 · 106 80 ppDb1mtyl 456 0.064 45 5.2 16 710 8.38 · 106 480 PipeDb2mt 64 0.016 48 2.7 16 2455 10.7


Related Projects at UBT

Optimization Techniques for Explicit Methods for theGPU-Accelerated Solution of Initial Value Problems of OrdinaryDifferential Equations (OTEGO): Tim Werner (supported by DFG)

Implicit ODE methods: Natalia Kalinnik

Energy efficiency: Matthias Stachowski

Load balancing and task programming models: Andreas Prell


Outline






Summary and outlookSummary

Large variety of HPC hardware, continuously changing. Increasingcomplexity and heterogeneity.

+ Self-adaptivity intends to make programs adaptable to new hardwareand new inputs, thus addressing some of the emergent challenges ofHPC.

Next stepsGeneralize the approach developed for time-step-based simulations.Investigate self-adaptation techniques.Try to do as much tuning as possible in the offline phase.Develop sophisticated strategies to select the fastest program variantwith corresponding parameters efficiently.Design tool set.


Date post:	21-Oct-2019
Category:	Documents
Upload:	others
View:	3 times
Download:	2 times

Self-Adaptation of Time-Step-based Simulation Techniques ... · Store Bu er adressierten Daten in...

Documents