Self-Adaptation of Time-Step-based SimulationTechniques on Heterogeneous HPC Systems
Matthias Korch
University of BayreuthDepartment of Computer Science
NESUS MC and WG Meeting
University of Bayreuth10 April 2017
Outline
1 Introduction: Why self-adaptive software?
2 Overview of the SeASiTe project
3 Preliminary work at UBT
4 Summary and next steps
M. Korch @ NESUS Meeting, 10 April 2017 1/49
Outline
1 Introduction: Why self-adaptive software?Current landscape of HPC architecturesCurrent challenges for HPC softwareSelf-adaptive software and automatic tuning
2 Overview of the SeASiTe project
3 Preliminary work at UBT
4 Summary and next steps
M. Korch @ NESUS Meeting, 10 April 2017 2/49
Variety of hardware in the current Top 500 (11/2016)
M. Korch @ NESUS Meeting, 10 April 2017 3/49
Life phases of SuperMUC (LRZ Munich)HPC systems can grow during their lifetime.+ Another cause of heterogeneity.
Phase 1 Phase 2Installation Date 2011 2012 2013 2015Processor Westmere Sandy Bridge Ivy Bridge HaswellAcccelerator Phi KNC#Nodes 205 9216 32 3072#Cores 8,200 147,456 512 + 3,840 86,016Memory [TByte] 52 288 2.56 194#Islands 1 18 1 6Nodes per Island 205 512 32 512Proc. per Node 4 2 2 + 2 2Cores per Proc. 10 8 8 + 60 14Cores per Node 40 (80) 16 (32) 16 (32) + 28 (56)
120 (480)Interconnect IB QDR IB FDR10 IB FDR10 IB FDR14
M. Korch @ NESUS Meeting, 10 April 2017 4/49
Microarchitecture of Intel Haswell and Skylake
Haswell (Source: c’t 14/2013, p. 116. c© Heise Medien)
Know-how | AMD-Ryzen: Mikroarchitektur
c’t 2017, Heft 674
fen auftreten, um sie dann besonders e -zient und energiesparend abzuarbeiten –diese Einheit hat Zen nicht.
Stack-TrackerDafür ist eine neuartige Stack-Tracking-Einheit hinzugekommen, die im letztenJahr unter US20140379986 patentiertwurde. Sie erweitert ein Konzept, das Intelmit der Stack Engine schon beim PentiumM eingeführt hatte. Diese Stack Enginebesitzt einen „Schatten-Stackpointer“und lauscht unmittelbar an der µOp-Queue, ob es dort Operationen wie Push,Pop, Call oder Ret gibt. Solche Befehlewürden normalerweise einen zusätzlichenOpcode für das entsprechende Neusetzendes Stackpointers an die Scheduler schi-cken. Das erübrigt sich, da es nun im Vor-feld bereits die Stack Engine erledigt. Er-kennt sie dabei Stackpointer-Manipulati-onsbefehle, schleust sie zuvor ein Stack-pointer-Synchronisations-µOp ein.
Beim Zen lauscht die Engine auchauf Adressierungen relativ zum Stack-pointer und speichert beim Schreiben in-nerhalb eines kleinen O�set-Bereichs ineinem eigenen Memle auch Daten.Letztere liefert sie dann bei einem späte-ren Lesen als spekulatives Store-to-Load-Forwarding frühzeitig aus. Sollte jedochirgendwas dazwischenfunken und dasDatum zwischenzeitlich verändern, mussder Prozessor das erkennen. Er verwirftdann das gespeicherte Datum und lädt esneu nach. Damit das möglichst seltenpassiert, hat Zens Stack Engine einen ei-genen Prediktor.
Diese Stack Engine hat nichts mit demsogenannten Return Stack zu tun. DieserReturn Stack ist eng an die Sprungvorher-sageeinheit gekoppelt. Er merkt sich beiUnterprogramm-Aufrufen (Near Calls) dieReturn-Adresse, zu der das Programm inden allermeisten Fällen auch zurückkehrt.Return Stacks haben daher üblicherweiseeine sehr hohe Tre�erquote und die Fol-gebefehle werden dann ohne Zeitverlustausgeführt. Bei Zen besitzt der Stack 32Einträge – doppelt so viele wie beim Bull-dozer und vermutlich auch doppelt so vielewie Skylake/Kaby Lake.
Neuronale GlaskugelDie Sprungvorhersageeinheit wurde ge-genüber Bulldozer deutlich ausgebaut –wie weit, verrät AMD noch nicht genau:
nur, dass die Bu�er „large“ sind. DieSprungvorhersage arbeitet wie beim Bull-dozer mit zwei Branch Target Bu�ern(BTB) in zwei Leveln. Bei Bulldozer ist derLevel-1-BTB als Set-assoziativer Cachemit 128 Sets in vier Wegen organisiert, hatalso 512 Einträge. Der Level-2-BTB ist 5-Wege-assoziativ mit 1024 Sets, besitztfolglich 5120 Einträge. Ab Steamrollerwurde er nochmal vergrößert, mutmaß-
lich auf das Doppelte. Für indirekte Sprün-ge, etwa für eine Sprungtabelle, hat derZen-Prozessor einen weiteren Bu�er mit512 Einträgen.
Für jede Verzweigung merkt er sich inden BTBs nicht nur die vorhergesagten,sondern auch die alternativen Adressen.Für die Vorhersagestrategie kommt eineArt neuronales Netz zum Einsatz, auchHash-Perceptron genannt. Tri�t die „Re-
ITLB128(4K, 8-way)
8 per Thread (2/4 MB)
Icache Instruction Cache,VIPT,32 KByte, 8-way,
64 Byte Cacheline
L1-DTLB64 (4K,4-way)
32(2/4MB, 4-way)4(1G,4-way)
L1-Daten-Cache(32 KByte, 8-way)
4 Takte Latenz, 96 Bytes/Cycle
L2-TLB1536 (4K,2/4MB,12-way)
16 (1G,4-way)
Line FillBuffers
Prefetch-Buffer (32 Byte)
Inst
ruct
ion
Poin
ter
PredecoderMacroOp-Fusion
BPUBranch Prediction Unit
Return Stack (≥16)
DSB Decode Stream Buffer1536 µOPs, 8fach assoziativ (32 x 8 x 6)
6 µOPs/cycle (HW:4)
LSDLoop Stream Detector
RRTRetire Register File
IDQ Instruction Decode Queue64 Entries (HW: 28) pro log. Prozessor, 64 (HW:56) bei Single Thread
RS Reservation Station, Scheduler(97 Entries, HW: 60)
MITEx86-Decoder
5 µOps/cycle (HW: 4)
ALLOC MicroOp FusionAllocate, Rename
Move Elimination,Zeroidiom
ROBReorder Buffer & Retirement Unit
224 Entries, HW:192
2xRATRegister Alias Table
(PRF: 180 (HW: 168) Int, 168 FP)
Store Buffer(56 Entries, HW:42)
Memory Order Buffer (MOB) Result Data Bus
Load Buffer(72 Entries
HW:72)
MSMSROM
StoreData
AGUST
AGUST | LD
AGUST | LD
Int ALUInt ShftBranch1
Int ALUVec ALUVec Shuf
LEA
Int ALUInt MULVec AddVec ALUVec Shft
Vec FMA/FPLEA
Int ALUInt Div
Vec AddVec ALUVec Shft
Vec FMA/FPBranch 2
L2-Cache, unified,256 MByte, 4-Way,
Writeback,≥12 Takte Latenz,
64 Byte/Cycle(29 sustained)
Port 0Port 1Port 5Port 6
O-o-OOut of Order
Port 7Port 2Port 3Port 4
Common L3-Cache,unified, Writeback≤2 MByte per Core,≤16-way,
44 Takte Latenz,32 Byte/cyc
(18 sustained)
DDR4Memory-Controller
HD-GrafikISP:
Image Signal Processor
Uncore
256
256
512
256256 256 512
Skylake-Mikroarchitektur Die Kaby-Lake-Architektur hat nur minimale Abweichungen
© Copyright by Heise Medien
Persönliches PDF für Matthias Korch aus 06667 Weißenfels
Skylake (Source: c’t 6/2017, p. 74. c© Heise Medien)
M. Korch @ NESUS Meeting, 10 April 2017 5/49
Microarchitecture of AMD Bulldozer and Zen
Bulldozer (Source: c’t 25/2011, p. 161. c© Heise Medien)
AMD-Ryzen: Mikroarchitektur | Know-how
c’t 2017, Heft 6 75
cent Prediction“ zu, soll sie ohne zusätz-lichen Takt auskommen. Wichtig für diePerformance ist aber auch die „Strafzeit“bei falscher Vorhersage. Bei Bulldozer be-trug sie 20 Takte, Ryzen soll 3 Takteschneller sein. Intel hat Details zurSprungvorhersage zuletzt beim SandyBridge verö�entlicht, zu neueren Genera-tionen schweigt sich Intel aus. SandyBridge hatte 14 bis 17 „Straftakte“.
Die größten Unterschiede in den Ab-laufdiagrammen von Zen und Skylake/Kaby Lake folgen nach dem Dispatcher.AMD verteilt hier wie beim Bulldozer ei-nerseits Integer- und andererseits Gleit-komma-/SIMD-Befehle auf separate Pfa-de. Ab hier beginnt dann auch die Out-of-Order-Execution, nämlich mit dem Regis-ter-Renaming. Es löst durch eine Vielzahlphysischer Register viele Abhängigkeiten
zwischen den Befehlen auf, die durch denauf 32 Register beschränkten x86-Regis-tersatz gegeben sind. Anschließend lassensich die entkoppelten Befehle out-of-or-der parallel ausführen. Zen und Skylakeliegen hier mit 160 und mehr Registernfür Integer (Int) und Gleitkomma (FP)etwa gleichauf. Jede Funktionseinheit hatbei Zen einen eigenen Scheduler mit 14(Int) oder 24 Einträgen (FP). Hier wartendie Befehle, bis alle Abhängigkeiten vonRegistern oder Speicherreferenzen aufge-löst sind und die Funktionseinheit freiwird. Die beiden Threads wetteifern hier-bei um die Ressourcen.
Zuweilen wird hier auch „gezockt“,etwa mit Spekulationen wie dem Store-to-Load-Forwarding. Hierbei wird ein in derReihenfolge späterer Load-Befehl vor einStore gezogen, in der Ho�nung, dass sichdiese nicht auf die gleichen Adressen be-ziehen. Die Korrektheit dieser Annahmemuss vor dem sogenannten Commit nochüberprüft werden. Der Commit bringt densichtbaren logischen x86-Registersatz aufden neuesten Stand und schreibt die vomStore Bu�er adressierten Daten in denSpeicher.
Bei Intel verbleiben seit dem Penti-um Pro Int- und FP-Befehle in einer ge-meinsamen Reservation Station, die beiSky lake die ungewöhnliche Größe von 97Einträgen hat. Daran sind die Funktions-einheiten von Int, FP, SIMD über 8 Portsangeschlossen; Zen hat zusammenge-rechnet insgesamt 10 Ports. Doch jed -wede schöne Ausführungsbreite nütztnichts, wenn später beim Zurückschrei-ben der Ergebnisse über die RetirementUnit Stau entsteht. Das ist bei Zen indeseher selten zu erwarten, denn er kann8 Operationen pro Takt und Thread ab-schließen. Die Retirement Unit ist dabeistatisch in je 96 Einträge pro Thread auf-geteilt, gemeinsam für Int und FP. Sky -lake hat insgesamt 224 Einträge. Die Core-Architektur konnte nur 4 �-Ops proTakt abschließen, bei Skylake spricht Intel unscharf von „wider Hyper-Threa-ding Retirement“. Es sind vermutlich 4 �-Ops pro Thread und Takt.
Auf die physischen Register Files(PRF) greifen demnach viele Einheiten zu: Rename Unit, Retirement Unit, dieFunktionseinheiten und auch der StoreBu�er – mitunter alle gleichzeitig. Das isteine Riesenherausforderung für das PRF-
Zen-Mikroarchitektur Einer von vier Kernen des Compute Cluster Complex (CCX)
Icache, L1 Instruction Cache, 64 KByte, 4-way, 64 Byte Cacheline, Microtags für IC und Op-Cache
Microtags für IC und Op-Cache
BPU (Branch Prediction Unit)Recent Predictor, große L1/L2- BTBs, Return Stack (32 Entries), Hash Perceptron
Instruction Byte Buffer (32 Byte)
Pick
Inst
ruct
ion
Poin
ter
Phys
ical R
eque
st Q
ueue
2K Op-Cache4, 6 oder mehr Ops/Takt?
x86-Decoder4 µ-Ops/Takt
Micro-Op Queue72 (2x36)
Stack Engine (trackt Stack-Operationen)mit „predictive Memory File“
MSMSROM
Dispatch, Branch-Fusion (cmp + jmp etc.)
max. 6 µ-Ops/Takt max. 4 µ-Ops/Takt
Integer Rename Floating PointRename
I-PRF168 Entries12 L + 6 S
Integer Scheduler (6 Queues à 14 Entries)
AGQ0 AGQ1 ALQ0 ALQ1 ALQ2 ALQ3
F-PRF160 Entries
8 L + 4 S
Floating Point Non Scheduling Queue
Floating Point Scheduler(4 Queues à 24 Entries)
AGULD | ST
AGULD | ST
Int ALUBranch
Int ALUBranch
Int ALUInt MUL
Int ALUInt Div
FP128Bit
ADD
FP128Bit
ADD
FP128BitMULFMA
FP128BitMULFMA8 ops/
Takt
Ret
irem
ent
Que
ue (1
92 E
ntrie
s)
Result Data Bus Result Data Bus
Load Buffer(72 Entries)
Store Buffer(44 Entries)
L2-Cache, unified, 512 KByte, 8-Way,Writeback, (inclusive)
L1/L2-DTLBL1: 64 (alle Page-Größen)L2:1,5K (6 way) keine 1G
L1-Daten-Cache(32 KByte, 8 way, Write Back, 4 Takte Latenz)
Common L3-Cache pro CCX, unified, 8 MByte
16-way, Victime, „mostly“ exclusive
32 Bytes/Taktbidirectional
32 Bytes/Takt
ITLBL0: 8 (alle Page-Größen)L1: 64 (alle Page-Größen)L2: 512 (keine 1GB-Pages)
O-o-O
Kompetitiv aufgeteiltKompetitiv aufgeteilt und SMT tagged
Kompetiv aufgeteilt mit algorithmischer Prioritätstatisch aufgeteilt
insgesamt 6 µ-Ops pro Takt
256
128 128
256
© Copyright by Heise Medien
Persönliches PDF für Matthias Korch aus 06667 Weißenfels
Zen (Source: c’t 6/2017, p. 75. c© Heise Medien)
M. Korch @ NESUS Meeting, 10 April 2017 6/49
Xeon Phi “Knights Landing”
two-wide, out-of-order core that is derivedfrom the Intel Atom core (based on Silvermontmicroarchitecture).2 It includes heavy modifi-cations over the inherited microarchitecture toincorporate features necessary for HPC work-loads, such as four threads per core, deeperout-of-order buffers, higher cache bandwidth,new instructions, better reliability, larger trans-lation look-aside buffers (TLBs), and largercaches. KNL introduces the new AdvancedVector Extensions instruction set, AVX-512,which provides 512-bit-wide vector instruc-tions and more vector registers.3 In addition, itcontinues to support all legacy x86 instruc-
tions, making it completely binary-compatiblewith prior Intel processors.
KNL introduces a new 2D, cache-coherentmesh interconnect that connects the tiles,memory controllers, I/O controllers, andother agents on the chip. The mesh intercon-nect provides the high-bandwidth pathwaysnecessary to deliver the huge amount of mem-ory bandwidth provisioned on the chip to thedifferent parts of the chip. The mesh supportsthe MESIF (modified, exclusive, shared, inva-lid, forward) cache-coherent protocol.4 Itemploys a distributed tag directory to keep theL2 caches in all tiles coherent with each other.Each tile contains a caching/home agent thatholds a portion of the distributed tag directoryand also serves as a connection point betweenthe tile and the mesh.
KNL has two types of memory: multi-channel DRAM (MCDRAM) and doubledata rate (DDR) memory. MCDRAM is the16-Gbyte high-bandwidth memory compris-ing eight devices (2 Gbytes each) integratedon-package and connected to the KNL die viaa proprietary on-package I/O. All eightMCDRAMdevices together provide an aggre-gate Stream triad benchmark bandwidth ofmore than 450 Gbytes per second (GBps; see
MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM
2 × 161 × 4
X4
DMI
EDC
DDR MC DDR MC
EDC EDC EDC
EDC EDC misc
PCIe
Gen 3
EDC EDC
Tile
3 D
D R
4 c
hannels
3 D
D R
4 c
hannels
DM
I
2 VPU 2 VPU
Core
16-Gbyte
MCDRAM
KNLDDR 4 Omni-
Path
Omni-
Path
ports
100
Gbps/
port
Package
×16 PCIe*
×4 PCIe
CHA
1-Mbyte
L2Core
36 tiles
connected by
2D mesh
interconnect
(a) (c)
(b)
Figure 1. Knights Landing (KNL) block diagram: (a) the CPU, (b) an example tile, and (c) KNL with Omni-Path Fabric integrated
on the CPU package. (CHA: caching/home agent; DDRMC: DDRmemory controller; DMI: Direct Media Interface; EDC:
MCDRAM controllers; MCDRAM: multichannel DRAM; PCIe: PCI Express; VPU: vector processing unit.)
Figure 2. KNL CPU die photo.
.............................................................
MARCH/APRIL 2016 35
Source: Sodani et al., IEEE Micro, March/April 2016, c© IEEE
M. Korch @ NESUS Meeting, 10 April 2017 7/49
Sunway TaihuLight
10,649,600 cores, rmax = 93,014 TFLOP/s
SW26010 (Source: Fu et al., Science China – Information Sciences 59, 072001 (2016), c© Science China and Springer)
M. Korch @ NESUS Meeting, 10 April 2017 8/49
Time of use of CPUs in the Top 500: RISC era
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 20140
50
100
150
200
250
Year
#S
yste
ms
Cray
Intel IA−64
MIPS
Power
Sparc
Alpha
PA−RISC
PowerPC
Intel i860
M. Korch @ NESUS Meeting, 10 April 2017 9/49
Time of use of CPUs in the Top 500: Recent Intel CPUs
2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5 2016 2016.5 20170
50
100
150
200
250
300
350
Year
#S
yste
ms
Intel Nehalem
Intel SandyBridge
Intel IvyBridge
Intel Haswell
Intel Broadwell
M. Korch @ NESUS Meeting, 10 April 2017 10/49
Lifetime of Top 500 No. 1 systems
CM-5 LANL 06/1993–11/1998 5.5 yearsNumerical Wind Tunnel NAL Japan 11/1993–06/2002 8.5 yearsIntel XP/S 140 Paragon Sandia 06/1994–11/1998 4.5 yearsHitachi SR2201 Uni Tokyo 06/1996–11/2000 4.5 yearsCP-PACS Uni Tsukuba 11/1996–06/2003 6.5 yearsASCI Red Sandia 06/1997–11/2005 8.5 yearsASCI White LLNL 11/2000–06/2007 6.5 yearsThe Earth Simulator Earth Simulator Center 06/2002–11/2013 11.5 yearsBlueGene/L LLNL 11/2004–06/2012 7.5 yearsRoadrunner LANL 06/2008–11/2012 4.5 yearsJaguar Oak Ridge 11/2009–06/2012 2.5 yearsTianhe-1A National SCC Tianjin since 11/2010 > 6 yearsK Computer RIKEN since 06/2011 > 5.5 yearsSequoia LLNL since 11/2011 > 5 yearsTitan Oak Ridge since 11/2012 > 4 yearsTianhe-2 National SCC Guangzhou since 06/2013 > 3.5 yearsSunway TaihuLight National SCC Wuxi since 06/2016 > 0.5 years
M. Korch @ NESUS Meeting, 10 April 2017 11/49
Summary of HPC landscape
Number of cores has reached order of 106.
Big variety of systems, components (CPUs, accelerators,interconnects, . . . ), and software (OS, compiler, libraries, . . . )
Lifetime of systems and time of use of components is limited.
Heterogeneity due to:
Extension phases during lifetime of a HPC systems
Use of accelerators.
Use of heterogeneous processors.
Complex memory hierarchies.
Heterogeneous network topologies.
M. Korch @ NESUS Meeting, 10 April 2017 12/49
Challenges for HPC software
Needs ultra-scalable algorithms.
Needs to be tolerant of faults.
Needs to handle variety of platforms.
Needs to handle heterogeneity.
+ Needs platform-specific tuning.+ Sometimes even needs input-specific tuning.
M. Korch @ NESUS Meeting, 10 April 2017 13/49
Challenges for HPC software
Needs ultra-scalable algorithms. + Nearly embarrassingly parallel.
Needs to be tolerant of faults.
Needs to handle variety of platforms.
Needs to handle heterogeneity.
+ Needs platform-specific tuning.+ Sometimes even needs input-specific tuning.
M. Korch @ NESUS Meeting, 10 April 2017 13/49
Challenges for HPC software
Needs ultra-scalable algorithms. + Nearly embarrassingly parallel.
Needs to be tolerant of faults.
Needs to handle variety of platforms.
Needs to handle heterogeneity.
+ Needs platform-specific tuning.+ Sometimes even needs input-specific tuning.
M. Korch @ NESUS Meeting, 10 April 2017 13/49
Challenges for HPC software
Needs ultra-scalable algorithms. + Nearly embarrassingly parallel.
Needs to be tolerant of faults.
Needs to handle variety of platforms.
Needs to handle heterogeneity.
+ Needs platform-specific tuning.+ Sometimes even needs input-specific tuning.
M. Korch @ NESUS Meeting, 10 April 2017 13/49
Challenges for HPC software
Needs ultra-scalable algorithms. + Nearly embarrassingly parallel.
Needs to be tolerant of faults.
Needs to handle variety of platforms and inputs.
Needs to handle heterogeneity.
+ Needs platform-specific tuning.+ Sometimes even needs input-specific tuning.
M. Korch @ NESUS Meeting, 10 April 2017 13/49
Challenges for HPC software
Needs ultra-scalable algorithms. + Nearly embarrassingly parallel.
Needs to be tolerant of faults.
Needs to handle variety of platforms and inputs.
Needs to handle heterogeneity.
+ Needs platform-specific tuning.+ Sometimes even needs input-specific tuning.
M. Korch @ NESUS Meeting, 10 April 2017 13/49
Challenges for HPC software
Needs ultra-scalable algorithms. + Nearly embarrassingly parallel.
Needs to be tolerant of faults.
Needs to handle variety of platforms and inputs.
Needs to handle heterogeneity.
+ Needs platform-specific tuning.
+ Sometimes even needs input-specific tuning.
M. Korch @ NESUS Meeting, 10 April 2017 13/49
Challenges for HPC software
Needs ultra-scalable algorithms. + Nearly embarrassingly parallel.
Needs to be tolerant of faults.
Needs to handle variety of platforms and inputs.
Needs to handle heterogeneity.
+ Needs platform-specific tuning.+ Sometimes even needs input-specific tuning.
M. Korch @ NESUS Meeting, 10 April 2017 13/49
Expectations on self-adaptivity
Handle some of the challenges for HPC software:
Avoid labor-intensive manual performance tuning.
Thus, provide portability of performance across different current andnew platforms.
Handle heterogeneous platforms which integrate several differentarchitectures (e.g., CPUs + accelerators).
Provide sensitivity to input.
M. Korch @ NESUS Meeting, 10 April 2017 14/49
Expectations on self-adaptivity
Handle some of the challenges for HPC software:
Avoid labor-intensive manual performance tuning.
Thus, provide portability of performance across different current andnew platforms.
Handle heterogeneous platforms which integrate several differentarchitectures (e.g., CPUs + accelerators).
Provide sensitivity to input.
M. Korch @ NESUS Meeting, 10 April 2017 14/49
Expectations on self-adaptivity
Handle some of the challenges for HPC software:
Avoid labor-intensive manual performance tuning.
Thus, provide portability of performance across different current andnew platforms.
Handle heterogeneous platforms which integrate several differentarchitectures (e.g., CPUs + accelerators).
Provide sensitivity to input.
M. Korch @ NESUS Meeting, 10 April 2017 14/49
Expectations on self-adaptivity
Handle some of the challenges for HPC software:
Avoid labor-intensive manual performance tuning.
Thus, provide portability of performance across different current andnew platforms.
Handle heterogeneous platforms which integrate several differentarchitectures (e.g., CPUs + accelerators).
Provide sensitivity to input.
M. Korch @ NESUS Meeting, 10 April 2017 14/49
Expectations on self-adaptivity
Handle some of the challenges for HPC software:
Avoid labor-intensive manual performance tuning.
Thus, provide portability of performance across different current andnew platforms.
Handle heterogeneous platforms which integrate several differentarchitectures (e.g., CPUs + accelerators).
Provide sensitivity to input.
M. Korch @ NESUS Meeting, 10 April 2017 14/49
Examples of self-adaptive software: ATLAShttp://math-atlas.sourceforge.net
“Automatically Tuned Linear Algebra Software”Uses empirical tuning at installation time for dense linear algebra toadapt parameters (e.g., block sizes) and source code:
Selection between hand-tuned code variants.Source code generation (for matrix multiply (GEMM)).
Is Search Really Necessary to Generate High-Performance BLAS?[Yotov et al. 2005]
M. Korch @ NESUS Meeting, 10 April 2017 15/49
Examples of self-adaptive software: ATLAShttp://math-atlas.sourceforge.net
“Automatically Tuned Linear Algebra Software”Uses empirical tuning at installation time for dense linear algebra toadapt parameters (e.g., block sizes) and source code:
Selection between hand-tuned code variants.Source code generation (for matrix multiply (GEMM)).
Is Search Really Necessary to Generate High-Performance BLAS?[Yotov et al. 2005]
M. Korch @ NESUS Meeting, 10 April 2017 15/49
Examples of self-adaptive software: ActiveHarmonyhttp://www.dyninst.org/harmony
(a)
(b)
Fig. 2. Fig.2-(a) shows the overall online tuning workflow. Fig.2-(b) showsapplication level view of the auto-tuning workflow
increases, the number of alternative code-variants grows ex-ponentially. A brute-force approach of generating all possiblecombinations is, thus, not practical. Instead, our approachgenerates code variants on-demand. This selective approachrequires an efficient code generator that can produce correctcode during runtime.
Developing an efficient code transformation framework thatensures correctness is a large area of research in and ofitself. Recently, a variety of compiler-based code transforma-tion frameworks [7], [10], [4], [17] have been developed togive programmers and compiler experts greater control overwhat optimization strategies to use for complex codes. Theseframeworks are designed to facilitate manual exploration of thelarge optimization space of possible compiler transformationsand their parameter values. Each of these frameworks has itsstrengths and weaknesses and as such, it is desirable to designa system that can easily select among the available tools basedon our code transformation requirements.
Active Harmony relies on standalone code-generation utility(or code-servers) for on-demand code generation. Here wedescribe the two most important features of this utility. First,the design of code-servers allows the users to easily select andswitch between available code transformation tools. We sep-arate the search-based navigation of the code-transformationparameter space and the code-generation process, which al-lows us to easily switch between different underlying code-generation tools (e.g. if we are tuning CUDA code, we canswitch to a code-transformation framework that supports GPUsvia CUDA or OpenCL). Second, our code-generation utility
can take advantage of idle (possibly remote) machines fordistributed code-generation and compilation. Users provide aset of available machines at the start of the tuning session.These machines do the actual code-generation work. Once allcode-variants are generated, the compiled code-variants aretransported to the scratch filesystem of the parallel machine,where the application being tuned is executing. After the code-generation is complete, our code-generation utility notifies theActive Harmony server about the status.
Figure 2-(a) shows a schematic diagram of the workflowwithin our online tuning system. Figure 2-(b) shows theapplication-level view of the tuning process. At each searchstep, the Active Harmony server issues a request to thecode-servers to generate code variants with a given set ofparameters for loop transformations. The code-variants thatare generated are compiled into a shared library (denotedas v_N.so in the figure 2-(b)). Once the code-generationis complete, the application receives a code-ready messagefrom the Active Harmony server. The nodes allocated tothe parallel application then load the new code using thedlopen-dlsym mechanism. The new code is executed andthe measured performance values (denoted as PM_N in thefigure 2-(b)) are consumed by the Active Harmony server tomake simplex transformation decisions. The timing of actualloading of new code is determined by hooks (inserted using theActive Harmony API) in the application. For example, in mostprograms, we load new code only on timestep boundaries.
Preparing an application for auto-tuning starts with outliningthe compute-intensive code-sections to separate functions. Wethen insert appropriate calls to the outlined functions usingfunction pointers. These function pointers are updated whennew codes become available. Currently, the code-sections areoutlined manually. In the future, we intend to automate thisprocess using the ROSE compiler framework [12]. Each noderunning the application keeps track of the best code-variantit has seen thus far in the tuning process. If the code-serverfails to deliver new versions on time, the nodes continue theirexecution with the best version that they have discovered up tothat point in the tuning process. The period where no new codeis available is referred as search_stall (see figure 2-(b)).The non-blocking relationship between application executionand dynamic code-generation is important in minimizing theonline tuning overhead. The application does not have towait until the new code becomes available. Furthermore, thisasynchronous relationship enables our auto-tuner to exercisecontrol over what code-generation utility to use, how manyparallel code-servers to run and how many code-variants togenerate in any given search iteration. The policy decisionsabout what code-variants to generate and evaluate at eachiteration is made completely by the centralized tuning server.
V. EXPERIMENTS
In this section, we present an experimental evaluation ofour framework. First, we conduct a study using as a testapplication, a Poisson’s equation solver program, to determinethe least number of parallel code-servers needed to ensure
4
Source: Tiwari and Hollingsworth, IPDPS’11
Runtime compilation andtuning framework for parallelprograms [Hollingsworth, Tiwariet al., since 1998].
Several search strategies:parallel rank order,Nelder-Mead, exhaustive,random.
Uses CHiLL for code (loop)transformations.
Constraint language.
Code servers.
M. Korch @ NESUS Meeting, 10 April 2017 16/49
Examples of self-adaptive software: Periscope TFhttp://periscope.in.tum.de
Close integration of performance analysis and tuning(makes use of Score-P).Tuning plugin interface.Several search strategies.Several plugins implemented:
Compiler Flags Selection Plugin,MPI Parameters Plugin,DVFS Plugin (Energy Tuning),Master Worker Plugin for tuning master worker applications,Parallel Pattern Plugin for tuning pipeline applications.
Pathway automatic performance engineering workflow system(Eclipse plugin).
M. Korch @ NESUS Meeting, 10 April 2017 17/49
What self-adaptivity cannot do
The space of possibilities of adaptation is predetermined.
Hence, self-adaptivity cannot
invent new algorithms,
port your program to new programming languages or environments,
maintain your software for you (modernize code, add features, fixbugs, . . . ).
M. Korch @ NESUS Meeting, 10 April 2017 18/49
What self-adaptivity cannot do
The space of possibilities of adaptation is predetermined.
Hence, self-adaptivity cannot
invent new algorithms,
port your program to new programming languages or environments,
maintain your software for you (modernize code, add features, fixbugs, . . . ).
M. Korch @ NESUS Meeting, 10 April 2017 18/49
What self-adaptivity cannot do
The space of possibilities of adaptation is predetermined.
Hence, self-adaptivity cannot
invent new algorithms,
port your program to new programming languages or environments,
maintain your software for you (modernize code, add features, fixbugs, . . . ).
M. Korch @ NESUS Meeting, 10 April 2017 18/49
What self-adaptivity cannot do
The space of possibilities of adaptation is predetermined.
Hence, self-adaptivity cannot
invent new algorithms,
port your program to new programming languages or environments,
maintain your software for you (modernize code, add features, fixbugs, . . . ).
M. Korch @ NESUS Meeting, 10 April 2017 18/49
What self-adaptivity cannot do
The space of possibilities of adaptation is predetermined.
Hence, self-adaptivity cannot
invent new algorithms,
port your program to new programming languages or environments,
maintain your software for you (modernize code, add features, fixbugs, . . . ).
M. Korch @ NESUS Meeting, 10 April 2017 18/49
Outline
1 Introduction: Why self-adaptive software?
2 Overview of the SeASiTe projectFunding and goalsPlanned approachContributors and expertise
3 Preliminary work at UBT
4 Summary and next steps
M. Korch @ NESUS Meeting, 10 April 2017 19/49
The SeASiTe project
Self-Adaptation of Time-Step-based SimulationTechniques on Heterogeneous HPC Systems
Funded by the German Federal Ministryof Education and Research (BMBF)Program “IKT 2020 – Research forInnovations”Call “Foundation-oriented Research forHPC Software”3 universities, 6 PhD students, 3 years
M. Korch @ NESUS Meeting, 10 April 2017 20/49
Initial research questions
1 How can we exploit the increasing computational performance ofHPC systems?
2 How can we handle the increasing complexity and heterogeneity ofHPC systems?
3 How can we help application developers make their software fit forlong-term use?
4 How can we obtain portability of performance?
+ Simulation software should adapt itself, as far as possibleautonomously, to the – continuously changing – heterogeneous HPChardware.
M. Korch @ NESUS Meeting, 10 April 2017 21/49
Initial research questions
1 How can we exploit the increasing computational performance ofHPC systems?
2 How can we handle the increasing complexity and heterogeneity ofHPC systems?
3 How can we help application developers make their software fit forlong-term use?
4 How can we obtain portability of performance?
+ Simulation software should adapt itself, as far as possibleautonomously, to the – continuously changing – heterogeneous HPChardware.
M. Korch @ NESUS Meeting, 10 April 2017 21/49
Project goals
Systematic investigation of techniques for self-adaptation. Whichtechniques are suitable for time-step-based simulations onheterogeneous HPC systems?
Design of a tool set with which application developers can addself-adaptation capabilities to their applications. Implementation of aprototype.
Demonstrate the applicability of the developed techniques and toolsfor representative applications.
M. Korch @ NESUS Meeting, 10 April 2017 22/49
Planned approach
Inclusion of specific properties and the runtime behavior of thesimulation algorithms in the self-adaptation: Subdivision oftime-step-based algorithms into regular and irregular algorithms.+ Consideration of different classes of algorithms.Offline phase: Preselection of program variants on the basis of aperformance model (ECM model). Different optimization criteria(runtime, energy) selectable.Online phase: Probe selected program variants with the given inputdata during the first time steps.Progression monitor to record significant changes of runtime orenergy behavior during the simulation runs.Inclusion of multiple self-adaptation techniques to meet theindividual characteristics of different applications.
M. Korch @ NESUS Meeting, 10 April 2017 23/49
Selected classes of applications1 Stencil codes (regular algorithms):
Iteration over grid.Known access structure of the stencil, constant for all grid points(only except boundary).
+ Probably well predictable.
2 Explicit ODE solver (structured-irregular algorithms):Nested loop structure, transformations possible.Can be applied to different ODE systems.Method coefficients, number of stages, problem size. . . should beruntime parameters. (Maybe even the ODE system itself.)
+ Input sensitivity.3 Particle simulations (dynamic-irregular algorithms):
Approximation of distant particles for long-range interactions(tree codes or particle-mesh).Cut-off radius for short-range interactions (linked cell).
+ Irregular, continuously changing access pattern.+ Dynamic load balancing required.
M. Korch @ NESUS Meeting, 10 April 2017 24/49
Selected classes of applications1 Stencil codes (regular algorithms):
Iteration over grid.Known access structure of the stencil, constant for all grid points(only except boundary).
+ Probably well predictable.2 Explicit ODE solver (structured-irregular algorithms):
Nested loop structure, transformations possible.Can be applied to different ODE systems.Method coefficients, number of stages, problem size. . . should beruntime parameters. (Maybe even the ODE system itself.)
+ Input sensitivity.
3 Particle simulations (dynamic-irregular algorithms):Approximation of distant particles for long-range interactions(tree codes or particle-mesh).Cut-off radius for short-range interactions (linked cell).
+ Irregular, continuously changing access pattern.+ Dynamic load balancing required.
M. Korch @ NESUS Meeting, 10 April 2017 24/49
Selected classes of applications1 Stencil codes (regular algorithms):
Iteration over grid.Known access structure of the stencil, constant for all grid points(only except boundary).
+ Probably well predictable.2 Explicit ODE solver (structured-irregular algorithms):
Nested loop structure, transformations possible.Can be applied to different ODE systems.Method coefficients, number of stages, problem size. . . should beruntime parameters. (Maybe even the ODE system itself.)
+ Input sensitivity.3 Particle simulations (dynamic-irregular algorithms):
Approximation of distant particles for long-range interactions(tree codes or particle-mesh).Cut-off radius for short-range interactions (linked cell).
+ Irregular, continuously changing access pattern.+ Dynamic load balancing required.
M. Korch @ NESUS Meeting, 10 April 2017 24/49
Partners
1 Chemnitz University of TechnologyGroup “Practical Computer Science”Gudula Rünger
2 Friedrich-Alexander UniversityErlangen-NurembergGroup “High-Performace Computing”Gerhard Wellein
3 University of Bayreuth (coordinator)Group “Parallel and Distributed Systems”Thomas Rauber and Matthias Korch
4 MEGWARE GmbH (associated partner)Jürgen Gretzschel
M. Korch @ NESUS Meeting, 10 April 2017 25/49
TU ChemnitzField of work:
Implementation of scientific computing applications on HPC systemsFocus: irregular algorithms and application-independent libraries (e.g.,sorting, data redistribution, scheduling, and data exchange betweensimulation programs)
Role:
Contribute expertise in irregular algorithms and, in particular, particlesimulations.Contribute experience with the development of libraries and tools.
Contributed software:
Application codes of particle simulationsLibraries for efficient data redistribution and communication inparticle simulations
M. Korch @ NESUS Meeting, 10 April 2017 26/49
FAU Erlangen-Nuremberg
Field of work:
Operation of HPC clusters, user supportStructured, model-based performance engineering (ECM model)
Role:
Contribute expertise in model-based optimization of stencil-like codestructures.Contribute experience with the collection of performance information.
Software:
LIKWIDKerncraft
M. Korch @ NESUS Meeting, 10 April 2017 27/49
University of BayreuthField of work:
Foundations, technology, and programming of parallel and distributedsystemsScalability of algorithms, in particular from scientific computingFocus: parallel ODE methodsSelf-adaptivity and autotuning
Role:Contribute experience with self-adaptation and ODE methods.
Contributed software:Library of ODE solvers and test problems with support forself-adaptivity
PhD students:Johannes Seiferth, Markus Straubinger
M. Korch @ NESUS Meeting, 10 April 2017 28/49
MEGWAREField of work:
Construction and installation of networked computer systemsSince 2000, focus on HPC systems, in particular Linux clusters.Regularly, built systems reach Top500.
Role:Contribute experiences gathered from their large number ofcustomers.Dissemination of project results to customers and business partners toencourage their feedback.Provide early knowledge about new hardware trends.Discussion of possible hardware extensions to support self-adaptivity.Perhaps test of software developed in the project with selectedindustrial applications.
M. Korch @ NESUS Meeting, 10 April 2017 29/49
Outline
1 Introduction: Why self-adaptive software?
2 Overview of the SeASiTe project
3 Preliminary work at UBTParallel numerical solution of ODE IVPsOverview work on optimization techniquesSelf-adaptivity and autotuning
4 Summary and next steps
M. Korch @ NESUS Meeting, 10 April 2017 30/49
ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.
Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.
Time steps
t
y
t
y
t0
y0
t
y
t0
y0
t
y
t0
y0
tet
y
t0
y0
t1
η1
tet
y
t0
y0
t1
η1
t2
η2
tet
y
t0
y0
t1
η1
t2
η2
t3η3
tet
y
t0
y0
t1
η1
t2
η2
t3η3
te
ηe
Components
t
y
t0 t1t
y
t0 t1 t2t
y
t0 t1 t2 t3t
y
t0 t1 t2 t3te
Definition (ODE IVP)y : R→ Rn,
y(t) = ?,y(t0) = y0,
f : R× Rn → Rn,
y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R
M. Korch @ NESUS Meeting, 10 April 2017 31/49
ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.
Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.
Time steps
t
y
t
y
t0
y0
t
y
t0
y0
t
y
t0
y0
tet
y
t0
y0
t1
η1
tet
y
t0
y0
t1
η1
t2
η2
tet
y
t0
y0
t1
η1
t2
η2
t3η3
tet
y
t0
y0
t1
η1
t2
η2
t3η3
te
ηe
Components
t
y
t0 t1t
y
t0 t1 t2t
y
t0 t1 t2 t3t
y
t0 t1 t2 t3te
Definition (ODE IVP)y : R→ Rn,
y(t) = ?,y(t0) = y0,
f : R× Rn → Rn,
y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R
M. Korch @ NESUS Meeting, 10 April 2017 31/49
ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).
Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.
Time steps
t
y
t
y
t0
y0
t
y
t0
y0
t
y
t0
y0
tet
y
t0
y0
t1
η1
tet
y
t0
y0
t1
η1
t2
η2
tet
y
t0
y0
t1
η1
t2
η2
t3η3
tet
y
t0
y0
t1
η1
t2
η2
t3η3
te
ηe
Components
t
y
t0 t1t
y
t0 t1 t2t
y
t0 t1 t2 t3t
y
t0 t1 t2 t3te
Definition (ODE IVP)y : R→ Rn,
y(t) = ?,y(t0) = y0,
f : R× Rn → Rn,
y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R
M. Korch @ NESUS Meeting, 10 April 2017 31/49
ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .
Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.
Time steps
t
y
t
y
t0
y0
t
y
t0
y0
t
y
t0
y0
te
t
y
t0
y0
t1
η1
tet
y
t0
y0
t1
η1
t2
η2
tet
y
t0
y0
t1
η1
t2
η2
t3η3
tet
y
t0
y0
t1
η1
t2
η2
t3η3
te
ηe
Components
t
y
t0 t1t
y
t0 t1 t2t
y
t0 t1 t2 t3t
y
t0 t1 t2 t3te
Definition (ODE IVP)y : R→ Rn,
y(t) = ?,y(t0) = y0,
f : R× Rn → Rn,
y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R
M. Korch @ NESUS Meeting, 10 April 2017 31/49
ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).
n equations/components, where n can be large.
Time steps
t
y
t
y
t0
y0
t
y
t0
y0
t
y
t0
y0
te
t
y
t0
y0
t1
η1
te
t
y
t0
y0
t1
η1
t2
η2
tet
y
t0
y0
t1
η1
t2
η2
t3η3
tet
y
t0
y0
t1
η1
t2
η2
t3η3
te
ηe
Components
t
y
t0 t1t
y
t0 t1 t2t
y
t0 t1 t2 t3t
y
t0 t1 t2 t3te
Definition (ODE IVP)y : R→ Rn,
y(t) = ?,y(t0) = y0,
f : R× Rn → Rn,
y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R
M. Korch @ NESUS Meeting, 10 April 2017 31/49
ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.
Time steps
t
y
t
y
t0
y0
t
y
t0
y0
t
y
t0
y0
te
t
y
t0
y0
t1
η1
te
t
y
t0
y0
t1
η1
t2
η2
tet
y
t0
y0
t1
η1
t2
η2
t3η3
tet
y
t0
y0
t1
η1
t2
η2
t3η3
te
ηe
Components
t
y
t0 t1
t
y
t0 t1 t2t
y
t0 t1 t2 t3t
y
t0 t1 t2 t3te
Definition (ODE IVP)y : R→ Rn,
y(t) = ?,y(t0) = y0,
f : R× Rn → Rn,
y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R
M. Korch @ NESUS Meeting, 10 April 2017 31/49
ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.
Time steps
t
y
t
y
t0
y0
t
y
t0
y0
t
y
t0
y0
tet
y
t0
y0
t1
η1
te
t
y
t0
y0
t1
η1
t2
η2
te
t
y
t0
y0
t1
η1
t2
η2
t3η3
tet
y
t0
y0
t1
η1
t2
η2
t3η3
te
ηe
Components
t
y
t0 t1
t
y
t0 t1 t2
t
y
t0 t1 t2 t3t
y
t0 t1 t2 t3te
Definition (ODE IVP)y : R→ Rn,
y(t) = ?,y(t0) = y0,
f : R× Rn → Rn,
y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R
M. Korch @ NESUS Meeting, 10 April 2017 31/49
ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.
Time steps
t
y
t
y
t0
y0
t
y
t0
y0
t
y
t0
y0
tet
y
t0
y0
t1
η1
tet
y
t0
y0
t1
η1
t2
η2
te
t
y
t0
y0
t1
η1
t2
η2
t3η3
te
t
y
t0
y0
t1
η1
t2
η2
t3η3
te
ηe
Components
t
y
t0 t1t
y
t0 t1 t2
t
y
t0 t1 t2 t3
t
y
t0 t1 t2 t3te
Definition (ODE IVP)y : R→ Rn,
y(t) = ?,y(t0) = y0,
f : R× Rn → Rn,
y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R
M. Korch @ NESUS Meeting, 10 April 2017 31/49
ODE Initial Value Problems (IVPs)Simulation of temporal development of state y(t) of some system.Known initial value y0 represents system state at time t0.Right-hand-side function f(t, y(t)) computes derivative y′(t).Problem: determine state at time te .Numerical solution: approximations ηκ advanced by time steps.Simple method (Euler): follow derivative (ηκ+1 = ηκ + h · f(tκ,ηκ)).n equations/components, where n can be large.
Time steps
t
y
t
y
t0
y0
t
y
t0
y0
t
y
t0
y0
tet
y
t0
y0
t1
η1
tet
y
t0
y0
t1
η1
t2
η2
tet
y
t0
y0
t1
η1
t2
η2
t3η3
te
t
y
t0
y0
t1
η1
t2
η2
t3η3
te
ηe
Components
t
y
t0 t1t
y
t0 t1 t2t
y
t0 t1 t2 t3
t
y
t0 t1 t2 t3te
Definition (ODE IVP)y : R→ Rn,
y(t) = ?,y(t0) = y0,
f : R× Rn → Rn,
y′(t) = f(t, y(t)),t ∈ [t0, te] ⊂ R
M. Korch @ NESUS Meeting, 10 April 2017 31/49
Runge-Kutta methods
Runge 1895, Heun 1900, Kutta 1901:
v1 = f(tκ,ηκ)v2 = f(tκ + c2hκ,ηκ + hκa21v1)v3 = f(tκ + c3hκ,ηκ + hκ(a31v1 + a32v2))
...vs = f(tκ + cshκ,ηκ + hκ(as1v1 + · · ·+ as,s−1vs−1))
ηκ+1 = ηκ + hκ(b1v1 + · · ·+ bsvs)
+ Higher order than explicit Euler.
M. Korch @ NESUS Meeting, 10 April 2017 32/49
Implicit Runge-Kutta methods
Cauchy 1824 (implicit Euler method):
ηκ+1 = ηκ + hκ · f(tκ,ηκ+1)
Butcher 1964:
vl = f(tκ + clhκ,ηκ + hκs∑
i=1alivi) l = 1, . . . , s
+ Solution of a nonlinear system required, but better stability, suitablefor stiff ODE systems.
M. Korch @ NESUS Meeting, 10 April 2017 33/49
Further ODE methods
One-step methods:Extrapolation methodsIterated RK methods. . .
Multi-step methods:Adams methodsBDF methods. . .
General linear methodsPeer methods. . .
Waveform relaxation methods. . .
M. Korch @ NESUS Meeting, 10 April 2017 34/49
Sources of parallelism
Parallelism across the method:Try to find large independent tasks in the computational structure ofone time step (“task parallelism”).Example: independent stages
+ Usually only small-scale parallelism, e.g., peer methods [Schmitt andWeiner] use up to 8 processors.
Parallelism across the system:Distribute the equations (components) of the ODE system amongprocessors (“data parallelism”).
+ Potential for parallelism depends on ODE system.
Parallelism across time steps:Try to overlap some adjacent time steps.More parallelism: Picard iteration. + Often slow convergence.Promising approach: Parareal methods.
M. Korch @ NESUS Meeting, 10 April 2017 35/49
Sources of parallelismParallelism across the method:
Try to find large independent tasks in the computational structure ofone time step (“task parallelism”).Example: independent stages
+ Usually only small-scale parallelism, e.g., peer methods [Schmitt andWeiner] use up to 8 processors.
Parallelism across the system:Distribute the equations (components) of the ODE system amongprocessors (“data parallelism”).
+ Potential for parallelism depends on ODE system.
Parallelism across time steps:Try to overlap some adjacent time steps.More parallelism: Picard iteration. + Often slow convergence.Promising approach: Parareal methods.
M. Korch @ NESUS Meeting, 10 April 2017 35/49
Sources of parallelismParallelism across the method:
Try to find large independent tasks in the computational structure ofone time step (“task parallelism”).Example: independent stages
+ Usually only small-scale parallelism, e.g., peer methods [Schmitt andWeiner] use up to 8 processors.
Parallelism across the system:Distribute the equations (components) of the ODE system amongprocessors (“data parallelism”).
+ Potential for parallelism depends on ODE system.
Parallelism across time steps:Try to overlap some adjacent time steps.More parallelism: Picard iteration. + Often slow convergence.Promising approach: Parareal methods.
M. Korch @ NESUS Meeting, 10 April 2017 35/49
Sources of parallelismParallelism across the method:
Try to find large independent tasks in the computational structure ofone time step (“task parallelism”).Example: independent stages
+ Usually only small-scale parallelism, e.g., peer methods [Schmitt andWeiner] use up to 8 processors.
Parallelism across the system:Distribute the equations (components) of the ODE system amongprocessors (“data parallelism”).
+ Potential for parallelism depends on ODE system.
Parallelism across time steps:Try to overlap some adjacent time steps.More parallelism: Picard iteration. + Often slow convergence.Promising approach: Parareal methods.
M. Korch @ NESUS Meeting, 10 April 2017 35/49
Previous research on optimization techniquesTarget platforms: CPU-based supercomputers and university clustersystems (Cray T3E, CLiC, JUMP, JUROPA, HLRB II, SGI UV,SuperMUC, . . . ); some experience with GPUs.Focus on loop transformations to improve locality.Tried to exploit the special structure of a large class of ODE systems(limited access distance).Pipeline-like loop structure enables cache reuse across stages andoften also storage space reduction.Methods considered:
Embedded RK methodsIterated RK methodsExtrapolation methodsAdams–Bashforth methodsPeer methods
M. Korch @ NESUS Meeting, 10 April 2017 36/49
Self-adaptivitySelected class of methods
Iterated RK methods
Y(0)l = yκ, l = 1, . . . , s, (predictor step)
k = 1, . . . ,m : (corrector step)
Y(k)l = yκ + hκ
s∑i=1
aliF(k−1)i , l = 1, . . . , s, (stages)
with F(k−1)i = f(tκ + cihκ,Y(k−1)
i )
yκ+1 = yκ + hκs∑
l=1blF(m)
l (new approximation vector)
yκ+1 = yκ + hκs∑
l=1blF(m−1)
l (embedded solution)
M. Korch @ NESUS Meeting, 10 April 2017 37/49
Self-adaptivityImplementation variantsIteration space:
k = 1, . . . ,m: corrector stepsl = 1, . . . , s: iteration over the vectors Y(k)
l (stages)i = 1, . . . , s: iteration over the summands of
∑si=1 aliF
(k−1)i
j = 1, . . . , n: iteration over the system dimension
Pre-existing Pthreads implementation variants:8 general variants:
Spatial locality, temporal locality for read/write accesses, loop tiling.Synchronization of all threads by barrier operations.
8 variants for f with limited access distance:Reduced memory consumption, pipelining of the corrector steps, looptiling, data distribution.Synchronization only with neighbors using locks.
NUMA taken into account.
M. Korch @ NESUS Meeting, 10 April 2017 38/49
Self-adaptivityImplementation variantsIteration space:
k = 1, . . . ,m: corrector stepsl = 1, . . . , s: iteration over the vectors Y(k)
l (stages)i = 1, . . . , s: iteration over the summands of
∑si=1 aliF
(k−1)i
j = 1, . . . , n: iteration over the system dimensionPre-existing Pthreads implementation variants:
8 general variants:Spatial locality, temporal locality for read/write accesses, loop tiling.Synchronization of all threads by barrier operations.
8 variants for f with limited access distance:Reduced memory consumption, pipelining of the corrector steps, looptiling, data distribution.Synchronization only with neighbors using locks.
NUMA taken into account.M. Korch @ NESUS Meeting, 10 April 2017 38/49
Self-adaptivityChallenges
Basic problem:Performance of implementation variants depends on IVP to be solved(in particular, access pattern and computational structure of f as wellas system dimension).
+ Cannot decide at compile time which variant will be the fastest.+ Have to make this decision at runtime (online tuning).
Challenges in doing this:Large search space:
Large number of possible implementation variants.Large number of possible parameter values (e.g., for loop tiling).
Large runtime differences of parallel variants due to different scaling.+ Have to keep runtime lost by searching for best configuration as small
as possible.
M. Korch @ NESUS Meeting, 10 April 2017 39/49
Self-adaptivityChallenges
Basic problem:Performance of implementation variants depends on IVP to be solved(in particular, access pattern and computational structure of f as wellas system dimension).
+ Cannot decide at compile time which variant will be the fastest.+ Have to make this decision at runtime (online tuning).
Challenges in doing this:Large search space:
Large number of possible implementation variants.Large number of possible parameter values (e.g., for loop tiling).
Large runtime differences of parallel variants due to different scaling.+ Have to keep runtime lost by searching for best configuration as small
as possible.
M. Korch @ NESUS Meeting, 10 April 2017 39/49
Self-adaptivityChallenges
Basic problem:Performance of implementation variants depends on IVP to be solved(in particular, access pattern and computational structure of f as wellas system dimension).
+ Cannot decide at compile time which variant will be the fastest.+ Have to make this decision at runtime (online tuning).
Challenges in doing this:Large search space:
Large number of possible implementation variants.Large number of possible parameter values (e.g., for loop tiling).
Large runtime differences of parallel variants due to different scaling.+ Have to keep runtime lost by searching for best configuration as small
as possible.
M. Korch @ NESUS Meeting, 10 April 2017 39/49
Self-adaptivityApproach developed
1 Offline profilingCollect information about runtimes of some input-independent programparts.
2 Preselection phaseFilter out implementation variants which are not applicable to the IVPor other runtime parameters.
3 Online profiling phaseCollect information about runtimes of some input-dependent programparts.
4 Autotuning computation phaseCompute first time steps with different different implementationvariants and parameters to find the fastest configuration.
5 Fast computation phaseCompute the remaining time steps with the best configuration found.
M. Korch @ NESUS Meeting, 10 April 2017 40/49
Self-adaptivityApproach developed
1 Offline profilingCollect information about runtimes of some input-independent programparts.
2 Preselection phaseFilter out implementation variants which are not applicable to the IVPor other runtime parameters.
3 Online profiling phaseCollect information about runtimes of some input-dependent programparts.
4 Autotuning computation phaseCompute first time steps with different different implementationvariants and parameters to find the fastest configuration.
5 Fast computation phaseCompute the remaining time steps with the best configuration found.
M. Korch @ NESUS Meeting, 10 April 2017 40/49
Self-adaptivityApproach developed
1 Offline profilingCollect information about runtimes of some input-independent programparts.
2 Preselection phaseFilter out implementation variants which are not applicable to the IVPor other runtime parameters.
3 Online profiling phaseCollect information about runtimes of some input-dependent programparts.
4 Autotuning computation phaseCompute first time steps with different different implementationvariants and parameters to find the fastest configuration.
5 Fast computation phaseCompute the remaining time steps with the best configuration found.
M. Korch @ NESUS Meeting, 10 April 2017 40/49
Self-adaptivityApproach developed
1 Offline profilingCollect information about runtimes of some input-independent programparts.
2 Preselection phaseFilter out implementation variants which are not applicable to the IVPor other runtime parameters.
3 Online profiling phaseCollect information about runtimes of some input-dependent programparts.
4 Autotuning computation phaseCompute first time steps with different different implementationvariants and parameters to find the fastest configuration.
5 Fast computation phaseCompute the remaining time steps with the best configuration found.
M. Korch @ NESUS Meeting, 10 April 2017 40/49
Self-adaptivityApproach developed
1 Offline profilingCollect information about runtimes of some input-independent programparts.
2 Preselection phaseFilter out implementation variants which are not applicable to the IVPor other runtime parameters.
3 Online profiling phaseCollect information about runtimes of some input-dependent programparts.
4 Autotuning computation phaseCompute first time steps with different different implementationvariants and parameters to find the fastest configuration.
5 Fast computation phaseCompute the remaining time steps with the best configuration found.
M. Korch @ NESUS Meeting, 10 April 2017 40/49
Self-adaptivityImplementation for the solution method selected
1 Offline profilingRuntime of barrier operations
2 Preselection phaseSpecialized implementation variants, number of equations per thread
3 Online profiling phaseMemory copy operations inside step control which are only used insome implementation variants
4 Autotuning computation phaseImplementation variants are ordered according to number of barrieroperations. + Exclude slow variants.Empirical search on a set of block sizes for loop tiling preselected by aworking set model.
5 Fast computation phase
M. Korch @ NESUS Meeting, 10 April 2017 41/49
Self-adaptivityImplementation for the solution method selected
1 Offline profilingRuntime of barrier operations
2 Preselection phaseSpecialized implementation variants, number of equations per thread
3 Online profiling phaseMemory copy operations inside step control which are only used insome implementation variants
4 Autotuning computation phaseImplementation variants are ordered according to number of barrieroperations. + Exclude slow variants.Empirical search on a set of block sizes for loop tiling preselected by aworking set model.
5 Fast computation phase
M. Korch @ NESUS Meeting, 10 April 2017 41/49
Self-adaptivityImplementation for the solution method selected
1 Offline profilingRuntime of barrier operations
2 Preselection phaseSpecialized implementation variants, number of equations per thread
3 Online profiling phaseMemory copy operations inside step control which are only used insome implementation variants
4 Autotuning computation phaseImplementation variants are ordered according to number of barrieroperations. + Exclude slow variants.Empirical search on a set of block sizes for loop tiling preselected by aworking set model.
5 Fast computation phase
M. Korch @ NESUS Meeting, 10 April 2017 41/49
Self-adaptivityImplementation for the solution method selected
1 Offline profilingRuntime of barrier operations
2 Preselection phaseSpecialized implementation variants, number of equations per thread
3 Online profiling phaseMemory copy operations inside step control which are only used insome implementation variants
4 Autotuning computation phaseImplementation variants are ordered according to number of barrieroperations. + Exclude slow variants.Empirical search on a set of block sizes for loop tiling preselected by aworking set model.
5 Fast computation phase
M. Korch @ NESUS Meeting, 10 April 2017 41/49
Self-adaptivityImplementation for the solution method selected
1 Offline profilingRuntime of barrier operations
2 Preselection phaseSpecialized implementation variants, number of equations per thread
3 Online profiling phaseMemory copy operations inside step control which are only used insome implementation variants
4 Autotuning computation phaseImplementation variants are ordered according to number of barrieroperations. + Exclude slow variants.Empirical search on a set of block sizes for loop tiling preselected by aworking set model.
5 Fast computation phase
M. Korch @ NESUS Meeting, 10 April 2017 41/49
Self-adaptivityTile size selection for loop tiling
Basis for the model: manually derived functions of problem size n,access distance d and number of threads p that describe the numberof vector elements accessed within loops (working sets).
Block sizes B1,B2, . . . ,BN are determined for relevant working setsand all cache levels.
Assumption for shared caches: every thread gets same share.
Empirical search samples preselected block sizes in increasing order(Bi < Bj for 1 ≤ i < j ≤ N) until no more improvement in runtime.
If smallest block size (B1) was best, divide by 2 until no moreimprovement in runtime (bisection method).
M. Korch @ NESUS Meeting, 10 April 2017 42/49
Self-adaptivityTile size selection for loop tiling
Basis for the model: manually derived functions of problem size n,access distance d and number of threads p that describe the numberof vector elements accessed within loops (working sets).
Block sizes B1,B2, . . . ,BN are determined for relevant working setsand all cache levels.
Assumption for shared caches: every thread gets same share.
Empirical search samples preselected block sizes in increasing order(Bi < Bj for 1 ≤ i < j ≤ N) until no more improvement in runtime.
If smallest block size (B1) was best, divide by 2 until no moreimprovement in runtime (bisection method).
M. Korch @ NESUS Meeting, 10 April 2017 42/49
Self-adaptivityTile size selection for loop tiling
Basis for the model: manually derived functions of problem size n,access distance d and number of threads p that describe the numberof vector elements accessed within loops (working sets).
Block sizes B1,B2, . . . ,BN are determined for relevant working setsand all cache levels.
Assumption for shared caches: every thread gets same share.
Empirical search samples preselected block sizes in increasing order(Bi < Bj for 1 ≤ i < j ≤ N) until no more improvement in runtime.
If smallest block size (B1) was best, divide by 2 until no moreimprovement in runtime (bisection method).
M. Korch @ NESUS Meeting, 10 April 2017 42/49
Self-adaptivityInfluence of the tile size, working spaces
0 1 2 3 4 5 6 7 8
x 104
2.4
2.6
2.8
3
3.2
3.4x 10
−6 AMD Opteron DP246, BRUSS2D, n=4.5*106, Lobatto III C(8), sequentiell
Tile size
Runtim
e p
er
ste
p a
nd c
om
ponent
PipeDb2mt
WS2, L1
WS3, L1
WS1, L2
WS2, L2
WS3, L2
M. Korch @ NESUS Meeting, 10 April 2017 43/49
Self-adaptivitySuccess of the tile size selection, influence of vectorization
BRUSS2D, n = 8 · 106, Lobatto III C(8), Intel Core i7-4770 (Haswell), 4 threads, ICC 14.0.1
101
102
103
104
105
106
0
50
100
150
200
Tile Size
Runtim
e d
egra
dation (
%)
PipeDb2mt, vectorizedselected tile size
101
102
103
104
105
106
0
50
100
150
Tile Size
Runtim
e d
egra
dation (
%)
PipeDb2mt, not vectorized selected tile size
Best block size Worst block sizeVectorized 128 → 0,817 s 1 048 576 → 2,360 sNot vectorized 256 → 0,977 s 1 048 576 → 2,393 s
M. Korch @ NESUS Meeting, 10 April 2017 44/49
Self-adaptivitySuccess of auto-tuning
STRING problem, n = 8 · 106, SGI UV
M. Korch @ NESUS Meeting, 10 April 2017 45/49
Self-adaptivitySuccess of auto-tuning
Implement. Tile Size TPS ]Steps Time for ]Impl. ]Steps Speedupn p Selected Selected Sel. (s) for AT AT (s) Eval. 5% Ovhd. Worst/Sel.
BRUSS2D on 4-socket AMD Opteron 6172 system (“Leo”)5 · 105 12 ppDb1mt 568 0.071 44 4.8 16 467 3.35 · 105 48 ppDb1mt 280 0.019 46 1.0 16 146 17.58 · 106 12 PipeDb1mtyl 728 1.250 53 111.2 16 720 5.58 · 106 48 PipeDb1mtyl 360 0.320 51 24.9 16 537 4.2STRING on 4-socket AMD Opteron 6172 system (“Leo”)5 · 105 12 ppDb1mt 608 0.031 59 4.1 16 1450 6.95 · 105 48 ppDb1mtyl 360 0.008 45 0.5 16 409 3.48 · 106 12 ppDb1mtyl 608 0.512 54 82.1 16 2126 13.38 · 106 48 ppDb1mtyl 560 0.127 50 18.7 16 1950 9.7BRUSS2D on SGI UV5 · 105 80 PipeDb2mt 1224 0.007 31 0.4 12 560 4.45 · 105 480 PipeDb1mt 208 0.006 20 1.3 9 4308 4.78 · 106 80 PipeDb1mtyl 1224 0.108 61 23.2 16 3077 16.38 · 106 480 PipeDb2mt 1840 0.028 43 5.0 12 2720 6.2STRING on SGI UV5 · 105 80 ppDb1mt 280 0.004 46 0.4 16 974 6.05 · 105 480 ppDb1mt 456 0.003 29 1.2 8 6720 9.38 · 106 80 ppDb1mtyl 456 0.064 45 5.2 16 710 8.38 · 106 480 PipeDb2mt 64 0.016 48 2.7 16 2455 10.7
M. Korch @ NESUS Meeting, 10 April 2017 46/49
Related Projects at UBT
Optimization Techniques for Explicit Methods for theGPU-Accelerated Solution of Initial Value Problems of OrdinaryDifferential Equations (OTEGO): Tim Werner (supported by DFG)
Implicit ODE methods: Natalia Kalinnik
Energy efficiency: Matthias Stachowski
Load balancing and task programming models: Andreas Prell
M. Korch @ NESUS Meeting, 10 April 2017 47/49
Outline
1 Introduction: Why self-adaptive software?
2 Overview of the SeASiTe project
3 Preliminary work at UBT
4 Summary and next steps
M. Korch @ NESUS Meeting, 10 April 2017 48/49
Summary and outlookSummary
Large variety of HPC hardware, continuously changing. Increasingcomplexity and heterogeneity.
+ Self-adaptivity intends to make programs adaptable to new hardwareand new inputs, thus addressing some of the emergent challenges ofHPC.
Next stepsGeneralize the approach developed for time-step-based simulations.Investigate self-adaptation techniques.Try to do as much tuning as possible in the offline phase.Develop sophisticated strategies to select the fastest program variantwith corresponding parameters efficiently.Design tool set.
M. Korch @ NESUS Meeting, 10 April 2017 49/49