2018/11/16
1
NEC SX Aurora TSUBASAmigration workshop
CAU Kiel
Dr. Jens-Olaf Beismann
Senior Benchmarking Analyst
NEC Deutschland GmbH
Agenda
10:00 – 10:30 SX Aurora Tsubasa : technology overview
10:30 – 11:30 Migration to SX-Aurora Tsubasa : compilers,
libraries, …
11:30 – 12:00 RZ Kiel : overview
Lunch break
13:00 – 17:00 Hands-on session: porting, run-time,
performance
SX Aurora TSUBASA
Technology overview
4 © NEC Deutschland GmbH 2018
�Dedicated vector processor
�High memory bandwidth
�Commodity processors
�De facto standard x86/Linux environment
2018/11/16
2
5 © NEC Deutschland GmbH 2018
Brand-new Vector Supercomputer
1.22TB/s / processor, 150GB/s / core
Fortran/C/C++ programing, OpenMPAutomatic vectorization/parallelization
High sustained performance onx86/Linux environment
TSUBASA: meaning “wing” in Japanese
6 © NEC Deutschland GmbH 2018
Strategy
user
Library
Tool
Application
Linux OSEnvironment
VectorEngine
x86peripherals
Linux open environment
Linux asset High performance
VE high performanceon
x86/Linux
7 © NEC Deutschland GmbH 2018
Architecture
Aurora Architecture� x86 node + Vector Engine (VE)
� VE capability is provided on x86/Linux environment
x86 server
Vector Engine
VE OS
Aurora Architecture
SX-Aurora TSUBASASX-Aurora TSUBASA
x86node
LinuxOS
Desktop Tower
Rack Mount Servers
Supercomputer
8 © NEC Deutschland GmbH 2018
Inherited and Changed
Previous SXPrevious SX SX-Aurora TSUBASASX-Aurora TSUBASA
Super-UX
SPU
VPU
coreprocessor
Mem.
storage
SPU
VPU
coreprocessor
Mem.x86
storage
Application ApplicationLINUX
VE OS
VHVector Host
VHVector Host
VEVector EngineVEVector Engine
2018/11/16
3
9 © NEC Deutschland GmbH 2018
GPGPU and VE
:
x86
Memory
GPGPUPCIe
Memory
x86
Memory
VEPCIe
Memory
exec
Result Transmission
Data Transmission
exit
exec
OS Function
Start Processing
exit End Processing
I/O,etc:
APCUDA
Function AP
Frequent PCIe transmission Whole AP is executed on VE
OS OS
� PCIe bottleneck� Small memory� Programming difficulty
disadvantagedisadvantage AdvantageAdvantage� Avoiding PCIe bottleneck� Larger memory� Standard language
GPGPU Architecture Aurora Architecture
10 © NEC Deutschland GmbH 2018
Processor
Software controllable cache16MB
coreVE1.0 Spec.
cores/CPU 8
core performance
~307GF(DP)~614GF(SP)
CPUperformance
~2.45TF(DP)~4.91TF(SP)
cachecapacity
16MB shared
memorybandwidth
1.22TB/s
memorycapacity
24, 48GB
core core core
core core core core
1.22TB/s
3TB/s
HBM2 memory x 6
0.4TB/s
307GF
2.45TF (@1.6GHZ)
11 © NEC Deutschland GmbH 2018
Product
processor
■Developed by NEC■World’s highest memory bandwidth
Card
Products of SX-Aurora TSUBASA
1VE 2VE 4VE 8VE
A100 Tower A300 Server A500 DLC Supercomputer
12 © NEC Deutschland GmbH 2018
SPUScalar Processing Unit
SPUScalar Processing Unit
Core Architecture
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
VFMA0
VFMA1
VFMA2
ALU0
ALU1
DIV
1.22TB/s / processor
(Ave. 150GB/s / core)400GB/s / core
Single core
Peak Performance :268.8GF = 32Flops/cycle x 2(FMA) x 3 x 1.4GHz
CAU Kiel : 17.2 TF
2018/11/16
4
13 © NEC Deutschland GmbH 2018
Characteristics
standard special
hig
h s
pec.
sta
ndard
language
sp
ecif
icati
on
Xeon®
GPGPUVectorEngine
Memory bandwidth / processor
Position
Xeon® GPGPU VectorEngine
x
14 © NEC Deutschland GmbH 2018
Fundamental Benchmarks
� STREAM: VE is the highest sustained memory bandwidth / node
� HPL: VE provides competitive FLOPS capability
HPL
/ n
ode
STRE
AM /
nod
e
HPL / nodeSTREAM / node
� VE provides same range HPL sustained performance as SKL/KNL
� VE provides the highest memory bandwidth
15 © NEC Deutschland GmbH 2018
Performance/Price
� High Price Competitiveness- The highest STREAM sustained performance / price
- Competitive HPL sustained performance / price
HPL
/ p
rice
STRE
AM /
pric
e
HPL / priceSTREAM / price
� VE provides same range HPL sustained performance/pricecompared to Intel products
� VE provides the highest memory bandwidth/price
16 © NEC Deutschland GmbH 2018
HPCG
� VE provides high HPCG performance per node and price
� HPL and STREAM are bookends of benchmark, and HPCG stands between them
HPCG / priceHPCG / node PerformanceCharacteristics
Performancebound
Memory bandwidth bound
3x 3xPe
rfor
man
ce ra
tio
Perf
orm
ance
ratio
Perf
orm
ance
ratio
Auro
ra
SKL
2018/11/16
5
SX Aurora TSUBASA
User environment
18 © NEC Deutschland GmbH 2018
Usability
Programing EnvironmentPrograming EnvironmentVector Cross Compiler
automatic vectorization automatic parallelization
Fortran: F2003, F2008(partially)
C: C11C++: C++14OpenMP: OpenMP4.5MPI: MPI3.1
$ vi sample.c$ ncc sample.c
Execution EnvironmentExecution Environment
$ a.out execution
VEVH
19 © NEC Deutschland GmbH 2018
Compilers
Cross compilers :
nfort
ncc
nc++
Tools :
nld, nar, nranlib, …
MPI wrappers :
mpinfort
mpincc
mpinc++
20 © NEC Deutschland GmbH 2018
Programming Environment
� NEC supports the latest language standards along with GNU compatibility
▌C/C++� ISO/IEC 9899:2011 (aka C11)
� ISO/IEC 14882:2014 (aka C++14)
▌Fortran� ISO/IEC 1539-1:2004 (aka Fortran 2003)
� ISO/IEC 1539-1:2010 (aka Fortran 2008)
▌OpenMP�Version 4.5
▌Libraries� libc
�MPI Version 3.1 (fully tuned for Aurora architecture)
�Numeric libraries (BLAS, FFT, Lapack, etc.)
▌Tools�GNU Profiler (gprof)
�GNU Debugger (gdb), Eclipse Parallel Tools Platform (PTP)
� FtraceViewer / PROGINF
2018/11/16
6
21 © NEC Deutschland GmbH 2018
Options
-Caopt -O4
-Chopt -O3
-Cvopt -O2
-Cvsafe -O1
-Cnoopt -O0
-Omove -fmove-loop-invariants-unsafe
-Onomovediv -fmove-loop-invariants
-Onomove -fno-move-loop-invariants
-Popenmp -fopenmp
-pi,auto -finline-functions # no cross-file
inlining
…
22 © NEC Deutschland GmbH 2018
Options
Compiler diagnostics / listings
-fdiag-vector=0|1|2 # more or less detailed
vectorization diagnostics
-fdiag-parallel=0|1|2
-fdiag-inline=0|1|2
-report-all # get both diagnostics and
formatted listing in .L file
Default type size
-fdefault-integer=4|8
-fdefault-real=4|8
-fdefault-double=8|16
Cache usage
-mretain-[all|list-vector|none]
23 © NEC Deutschland GmbH 2018
Directives
!CDIR … !NEC$ …
nodep ivdep
expand=n unroll(n)
move move_unsafe
nomovediv move
nomove nomove
outerunroll=n outerloop_unroll(n)
unroll=n unroll(n)
…
� Directive conversion tool nfdirconv !
24 © NEC Deutschland GmbH 2018
Libraries
▌NEC Library provides wide variety of functions
�NEC library is fully tuned for Aurora architecture
NEC Lib MKL
BLAS � �
LAPACK � �
ScaLAPACK � �
FFT � �
Random number generators � �
Direct sparse solvers � �
Iterative sparse solvers � �
Functions for Statistics � �
Spline functions � �
Special functions �
Approximation and Interpolation �
Numerical Differentials/Integrals �
Roots of Equations �
Time series analysis �
Sorting and ranking �
2018/11/16
7
25 © NEC Deutschland GmbH 2018
UNIX system function interface
To use extensions like GETARG, FLUSH, ABORT, … subroutines,
compile with
-use F90_UNIX[,F90_UNIX_ENV,…]
See Fortran Users’ Guide 8.2 for details
26 © NEC Deutschland GmbH 2018
Endianness
SX-Aurora TSUBASA is little-endian ! (Former SXs were big-endian)
export VE_FORT_UFMTENDIAN=10,11 (ALL)
sets the unit number of an unformatted file to be treated as a file in big-endian format. When the value of this variable is ALL, then
all unit numbers are applied. Two or more unit numbers can be specified by comma delimitation.
GNU Fortran extension : convert specifier
open(10,file=‘test.dat’, form=‘unformatted’, &
convert=‘big_endian’)
� Non-standard Fortran, but supported by nfort
27 © NEC Deutschland GmbH 2018
Correctness
Run-time errors…?
Compile with -traceback
export VE_TRACEBACK=FULL|ALL
reduce optimization
-fcheck=bounds (all)
Stack/heap initialization
–minit-stack=zero|nan
export VE_INIT_HEAP=ZERO|NAN
-mno-vector-fma
export VE_FPE_ENABLE=(DIV,FOF,FUF,INV,INE)
(debugger)
28 © NEC Deutschland GmbH 2018
Debugger
Process Sets Functions
Standard outputStandard error outputSource code
variables
Stack trace
Eclipse parallel tools platform (PTP) VE plugin provides GUI debugging environment
2018/11/16
8
29 © NEC Deutschland GmbH 2018
Compile with “-proginf”
export VE_PROGINF=YES/DETAIL
PROGINF performance information
30 © NEC Deutschland GmbH 2018
Compile with “-proginf”
export VE_PROGINF=YES/DETAIL
******** Program Information ********
Real Time (sec) : 8783.021690
User Time (sec) : 8753.527852
Vector Time (sec) : 4959.702777
Inst. Count : 8018493848355
V. Inst. Count : 1081598389267
V. Element Count : 221178430822266
V. Load Element Count : 51426036073697
FLOP Count : 140842692526140
MOPS : 29663.826745
MOPS (Real) : 29444.837925
MFLOPS : 16155.280253
MFLOPS (Real) : 16036.016282
A. V. Length : 204.492197
V. Op. Ratio (%) : 97.317633
L1 Cache Miss (sec) : 801.540473
CPU Port Conf. (sec) : 2.435367
V. Arith. Exec. (sec) : 2355.046188
V. Load Exec. (sec) : 2346.286974
VLD LLC Hit Element Ratio (%) : 79.566662
Power Throttling (sec) : 0.000000
Thermal Throttling (sec) : 0.000000
Memory Size Used (MB) : 10956.000000
Start Time (date) : Tue Nov 6 13:05:18 2018 CET
End Time (date) : Tue Nov 6 15:31:41 2018 CET
PROGINF performance information
31 © NEC Deutschland GmbH 2018
configure
autoconf: https://github.com/SX-Aurora/autoconf-helper
configure command:
./configure CC=ncc CXX=nc++ FC=nfort F90=nfort \
AR=nar LD=nld AS=nas --host=ve-nec-linux
CMAKE Toolchain (example): https://github.com/SX-Aurora/CMake-toolchain-file
Documentation
Available at
www.rz.uni-kiel.de/de/angebote/hiperf/nec-
sx-aurora-tsubasa
2018/11/16
9
33 © NEC Deutschland GmbH 2018
Documentation
34 © NEC Deutschland GmbH 2018
▌Official Documentation: http://www.nec.com/en/global/prod/hpc/aurora/document
▌Official Software: http://www.nec.com/en/global/prod/hpc/aurora/ve-software
▌Open Source Software: https://github.com/SX-Aurora
Aurora Forum Website
36 © NEC Deutschland GmbH 2018
Aurora Forum community website
Visit https://www.hpc.nec and join our questVisit https://www.hpc.nec and join our quest
JoinJoin
2018/11/16
10
37 © NEC Deutschland GmbH 2018
Aurora Forum community website
Better communication through BBS, let’s discuss openly!Better communication through BBS, let’s discuss openly!
Roadmaprequest
PortingEvaluation Report
Tuning
38 © NEC Deutschland GmbH 2018
Aurora Forum community website
About posting – Please like for good postsAbout posting – Please like for good posts
�Post text and arbitrary file(word, ppt, pdf, picture movie, software, etc. there is not limitation for the type of file).
Like!
39 © NEC Deutschland GmbH 2018
Aurora Forum community website
Let’s develop better Aurora together!!Let’s develop better Aurora together!!