2014 LENOVO. ALL RIGHTS RESERVED.
2
Outline
Parallel System description. • p775, p460 and dx360M4, Hardware and Software
Compiler options and libraries used.
WRF tunable parameters for scaling runs. • nproc_x, nproc_y, numtiles, nio_groups, nio_tasks_per_group
WRF I/O • Serial and Parallel netcdf libraries for data I/O.
WRF runtime parameters for scaling runs. • SMT, MPI task grids, OpenMP threads, quilting tasks.
WRF performance components for scaling runs.
• Computation, Communication, I/O, load Imbalance.
WRF scaling run results on p775, p460 and dx360M4.
Conclusions.
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
3
IBM POWER 7 p775 system (P7-IH)
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
© 2012 IBM Corporation
/
4
6
4
0
0
2
/
/
/
-
-
/ -
/ -
(
(
Management and
Network Rack
7316 TF 3 KEYB DISP 2
M M
D U
7316 TF 3 KEYB DISP 1
BLANK
BLANK
BLANK
P D U
P D
BLANK
BLANK
Air Duct
4273 J 48 E x 48 1 GE , 2 10 GE Blank
4273 J 48 E x 48 1 GE , 2 10 GE Blank
7042 - CR 6 HMC 1 7042 - CR 6 HMC 2
EMS - 2
p 750 8 W 32 GB )
EMS - 1
p 750 8 W 32 GB )
HDD ( 6 HDD )
HDD ( 6 HDD )
7042 - CR 6 HMC 3
.
.
.
.
.
.
-
+
4
6
4
0
0
2
P
+
-
/
/
U
-
+
4
6
4
0
0
2
P
+
-
/
/
U
-
+
4
6
4
0
0
2
P
+
-
/
/
U
Compute Frame A
Rack 001
Compute Frame B
Rack 002
Compute Frame C
Rack 003
Similar to the old ECMWF system, Water Cooled • Rack: 3 Supernodes; Supernode: 4 drawers; Drawer: 32 POWER7 chips; Chip: 8 POWER7 cores.
• 32 Cores/node @ 3.86 GHz, 256 GB/node, 8 GB/core, 1024 cores/drawer, 3072 cores/rack,
• L1: 32 KB instruction/32 KB data; L2: 256 KB/core; L3: 4 MB/core, Dual memory channel
• Diskless node, Torrent POWER7 chip, GiGE adaptors on service nodes
• HFI fibre network, 4 x storage (GPFS) nodes, 3 x Disk Enclosures.
• AIX 7.1, XLF14, VAC 12, GPFS, IBM Parallel Environment, LoadLeveler
• MPI profile Library, Floating Point Monitor, Vector and Scalar Mass Libraries
4
IBM POWER 7 p460 Pureflex system (Firebird)
CMA system partition, Air Cooled, Rear Door Heat Exchangers. • Rack: 4 chassis; Chassis: 7 nodes; Node: 4 POWER7 chips; Chip: 8 POWER7 cores
• 32 Cores/node @ 3.55 GHz,128 GB/node, 4 GB/core, 224 cores/chassis, 896 cores/rack,
• L1: 32 KB instruction/32 KB data; L2: 256 KB/core; L3: 4 MB/core, single memory channel
• 2 HD per node, 2 QDR Dual port Infiniband adaptors, GiGE
• Dual Rail Fat tree Infiniband network, 10 x p740 storage (GPFS) nodes, 8 x DSC3700 devices.
• AIX 7.1, XLF14, VAC 12, GPFS, IBM Parallel Environment, LoadLeveler
• MPI profile Library, Vector and Scalar Mass Libraries 2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
Compute Rack 1
4 Chassis, 28 nodes
Compute Rack 2
4 Chassis, 28 nodes
Compute Rack 3
4 Chassis, 28 nodes Compute Rack 4
4 Chassis, 28 nodes
Compute Rack 5
4 Chassis, 28 nodes
Storage Rack 1
10xp470 nodes
Storage Rack 2
8xDSC3700+ QDR Infiniband Rack 1
4 Switches/management
Compute Rack 6
4 Chassis, 28 nodes
Compute Rack 7
4 Chassis, 28 nodes
Compute Rack 8
4 Chassis, 28 nodes Compute Rack 9
4 Chassis, 28 nodes
Compute Rack 10
4 Chassis, 28 nodes
5
IBM iDataplex dx360 M4 system (Sandybridge)
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
NCEP WCOSS system partition, Air Cooled, Rear Door Heat Exchangers • Rack: 72 Nodes; Node: 2 Intel dx360 m4 sockets; Socket: 8 Intel E5-2670 Sandybridge cores.
• 16 Cores/node @ 2.6 GHz, 32 GB/node, 2 GB/core, 1152 cores/rack,
• L1: 32 KB instruction/32 KB data; L2: 256 KB/core; L3: 20 MB shared per 8 cores
• 1 HD per node, Mellanox connect X - FDR, 10GiGE ports
• Mellanox IB4X FDR full bisection Fat Tree network, 10 x 3650 M4 storage nodes, 20 x DSC3700
• RHEL6.2, Intel FORTRAN 13, C 11, GPFS, IBM Parallel Environment, Platform LSF
• MPI profile Library
Compute Rack 1
72 nodes
Compute Rack 2
72 nodes
Compute Rack 3
72 nodes
Compute Rack 4
72 nodes
Compute Rack 5
72 nodes
Compute Rack 6
72 nodes
Compute Rack 7
72 nodes
Compute Rack 8
72 nodes Infiniband Rack 1
FDR 14
I/O Rack 2
10 nodes
I/O Rack 2
12 DSC3700
6
Tested Systems Summary
2014 LENOVO. ALL RIGHTS RESERVED. 16TH ECMWF HPC WORKSHOP
Core
frequncy
(GHz)
Memory
(GB/core)
Cores
tested
Vector SMT/HT
settings
IC Fabric OS Storage Compilers Parallel
Envrnmnt
Queueing
systems
Libraries
p775 3.86 8
(All dimms
occupied)
6144 VSX
only for
intrinsics
SMT4
(SMT2 to
2048 cores)
HFI AIX
7.1
GPFS
4 NSD
3 Disk Enclsr
IBM XL
Compilers
(AIX v14)
IBM PE
for AIX
Load
Leveler
MPI Profile,
MASS,
Hardware
Perfrmnce
Monitor
p460 3.55 4
(All dimms
occupied)
8192 VSX
Only for
Intriscics
SMT4
(SMT2 to
512 cores)
QDR
Dual-Rail
Infiniband
Fat Tree
AIX
7.1
GPFS
10 NSD
8 DSC3700
IBM XL
Compilers
(V14)
IBM PE
for AIX
Load
Leveler
MPI Profile,
MASS
dx360M4 2.6 2
(All dimms
occupied)
6144 AVX
Every
where
HT
(Not used)
FDR
Single Rail
Infiniband
Fat Tree
RHEL
6.2
GPFS
10 NSD
20 DSC3700
Intel
Studio 13
Compilers
IBM PE
for linux
LSF MPI Profile
7
WRF v3.3 Compiler Options
2014 LENOVO. ALL RIGHTS RESERVED. 16TH ECMWF HPC WORKSHOP
Identical WRF
code was used COMPILER Options Libraries
p775
FCOPTIM = -O3 –qhot –qarch=pwr7 –qtune=pwr7
FCBASEOPTS = -qsmp=omp -qcache=auto –qfloat=rsqrt
netcdf,
pnetcdf,
massp7_simd,
massvp7,
mpihpm
p460
FCOPTIM = -O3 –qhot –qarch=pwr7 –qtune=pwr7
FCBASEOPTS = -qsmp=omp -qcache=auto –qfloat=rsqrt
netcdf,
pnetcdf,
massp7_simd,
massvp7,
mpitrace
dx360M4 FCOPTIM = -O3 -xAVX -fp-model fast=2 –ip
FCBASEOPTS = -ip -fno-alias -w -ftz -no-prec-div
-no-prec-sqrt -align all –openmp
netcdf,
pnetcdf,
mpitrace
8
WRF tunables for scaling runs WRF run specifics as defined in namelist.input:
• 5km horizontal resolution, 6 sec time step, 12-hour forecast.
• 2200 X 1200 X 28 grid points.
• One output file per forecasting hour.
• Four Boundary reads every three forecast hours.
Same WRF tunables were used for every system, based on selections that
yielded optimal performance on the p775 system. • nproc_x: Logical MPI task partition in x-direction.
• nproc_y: Logical MPI task partition in y-direction.
• numtiles: Number of tiles that can be used in OpenMP.
• nproc_x X nproc_y = number of MPI tasks. Critical in comparing communication characteristics.
SMT was used only on the IBM POWER systems. • SMT was employed on POWER systems by using two OpenMP threads per MPI task.
• SMT2 was used for runs with less than 2048 cores on p775 (Best performance).
• SMT2 was used for runs with less than 512 cores on p460 (Best performance).
• Hyper-threading was not beneficial on dx360M4 system (not used). 2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
9
WRF I/O
I/O reading and writing of WRF variables. • 13 I/O write steps, each writing a 7.5 GB file.
• 1 Read for the Initial conditions (6.74 GB file).
• 4 Reads for the boundary conditions (1.47 GBs each).
• Data ingests cannot be done asynchronously.
Parallel netcdf was used for data ingests (MPI-IO) for all scaling runs • Read option 11 for initial and boundary data.
I/O netcdf quilting was used to write data files. • Assign the same I/O tasks and groups on all three systems for each of the scaling runs.
• Last I/O step is done synchronously, since WRF computations terminate.
• Quilting I/O times on WRF timers report I/O synchronization time only.
• I/O is done by quilting tasks on the I/O subsystem while compute tasks compute.
I/O Parallel netcdf quilting was not used. • Can further improve I/O writing steps, especially the last I/O step.
• Early WRF version had problems with IBM Parallel Environment and parallel netcdf.
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
10
WRF uniform variables
tunables
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
0
4
8
12 13
9
14 15
10 11
5
1
6 7
2 3
0 1
8 9
2 3
10 11
4 5
12 13
6 7
14 15
nproc_x = 4
nproc_y = 4
nproc_x = 2
nproc_y = 8
16 17 16 17
nio_groups=1
nio_tasks_per_group=2
Unchanged variables on all systems for the same number of physical cores • nproc_x; nproc_y, nio_groups, nio_tasks_per_group, numtiles
Performance on POWER systems was found to be always better when: • nproc_x < nproc_y (thin rectrangular MPI task decomposition)
POWER processors have large non-shared L3 caches, so probably thin decompositions offer better cache utilization.
Out-of-stride data copying from application to system buffers for MPI communication is minimized on thin decompositions.
O3 –qhot compiler option introduces vector MASS library calls, which deliver better throughput for long vectors.
Performance on dx360M4 was better when: • numtiles > 1 even for runs with a single OpenMP thread.
numtiles > 1 acts as a cache block mechanism (like NPROMA)
p775 and p460 are favored against dx360M4 • For the testing scenarios if nproc_x < nproc_y.
• numtiles > 2 did not have an effect on performance.
Choice of nproc_x, nproc_y, numtiles: • Was based on best performance on the p775
• numtiles = 4 was set as an advantage on dx360M4
11
WRF runtime parameters
Task affinity (binding) was used in all test runs. • SMT/HT: ON – green background (One extra OpenMP thread ), OFF – yellow background.
• Variables: OpenMP threads, numtiles, MPI tasks, nproc_x, nproc_y, nio_groups, nio_tasks_per_group
• Physical/Logical cores = OpenMP_threads * (nproc_x * nproc_y +nio_groups*nio_tasks_per_group)
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
Number of
p775 nodes
OpenMP
Threads
Number of
p460 nodes
OpenMP
Threads
Number of
dx360 nodes
OpenMP
Threads
numtiles Number
of cores
MPI
Tasks
nproc_x
x
nproc_y
nio_groups
x
nio_tasks_per_group
4 2 4 2 8 1 4 128 124 4x31 1x4
8 2 8 2 16 1 4 256 252 6x42 1x4
16 2 16 2 32 1 4 512 504 8x63 1x8
32 4 32 2 64 2 4 1024 506 11x46 1x6
48 4 48 2 96 2 4 1536 760 19x40 1x8
64 4 64 2 128 2 4 2048 1020 20x51 1x4
96 4 96 4 192 4 4 3072 760 19x40 1x8
128 4 128 4 256 4 4 4096 1020 17x60 1x4
192 4 192 4 384 4 4 6144 1530 18x85 1x6
12
WRF Run statistics
Runs on p775 were done with the MPIHPM library: • Collect Hardware performance monitor data (small overhead).
Scale HPM data to p460 and dx360M4 systems by frequency ratios)
Estimate Sustained GFLOP rates on all systems.
System Peak rate = (number of cores x 8 x core frequency).
• Collect MPI communication statistics.
Runs on p460 and dx360M4 were done with the MPITRACE library: • Collect MPI communication statistics.
MPI communication from trace libraries can help estimate: • Communication: (minimum communication among all MPI tasks involved).
• Load Imbalance: (median communication – minimum communication).
Accumulation of internal WRF timers can help estimate: • Read I/O times (initial file read time + boundary read times).
• Write I/O times (I/O Write quilting time from synchronization + Last I/O Write time step).
• Last I/O write step: ~(total elapsed time – total time from internal timers).
• Total Computation (Pure computation + communication + Load imbalance).
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
13
WRF Scaling Results
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
0.00
2000.00
4000.00
6000.00
8000.00
10000.00
12000.00
14000.00
16000.00
18000.00
20000.00
128 256 512 1024 1536 2048 3072 4096 6144
p7
75
To
tal E
lap
sed
tim
e (
seco
nd
s)
Number of p775 Cores
Write I/O
Read I/O
Initialization+termination
Load imbalance
Communication
Computation
14
WRF Scaling Results
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
0.00
2000.00
4000.00
6000.00
8000.00
10000.00
12000.00
14000.00
16000.00
18000.00
20000.00
128 256 512 1024 1536 2048 3072 4096 6144
p4
60
To
tal E
lap
sed
tim
e (
seco
nd
s)
Number of p460 Cores
Write I/O
Read I/O
Initialization+termination
Load imbalance
Communication
Computation
15
WRF Scaling Results
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
0.00
2000.00
4000.00
6000.00
8000.00
10000.00
12000.00
14000.00
16000.00
18000.00
20000.00
128 256 512 1024 1536 2048 3072 4096 6144
dx3
60
M4
To
tal E
lap
sed
tim
e (
seco
nd
s)
Number of dx360M4 Cores
Write I/O
Read I/O
Initialization+termination
Load imbalance
Communication
Computation
16
WRF Run statistics
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
0.00
5000.00
10000.00
15000.00
20000.00
25000.00
128 256 512 1024 1536 2048 3072 4096 6144
Tota
l Ela
pse
d T
ime
(sec
)
Number Of Cores
p775
dx360M4
p460
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
900.00
1000.00
128 256 512 1024 1536 2048 3072 4096 6144
Tota
l Co
mm
un
icat
ion
Tim
e (s
ec)
Number Of Cores
p775
dx360M4
p460
0.00
2000.00
4000.00
6000.00
8000.00
10000.00
12000.00
14000.00
16000.00
18000.00
20000.00
128 256 512 1024 1536 2048 3072 4096 6144Tota
l Pu
re C
om
pu
tati
on
Tim
e (s
ec)
Number Of Cores
p775
dx360M4
p460
0.00
500.00
1000.00
1500.00
2000.00
2500.00
128 256 512 1024 1536 2048 3072 4096 6144To
tal L
oad
Imb
alan
ce T
ime
(sec
)
Number Of Cores
p775
dx360M4
p460
17
WRF Run statistics
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
0.00
0.50
1.00
1.50
2.00
2.50
3.00
128 256 512 1024 1536 2048 3072 4096 6144Ave
rage
co
mp
ute
tim
e p
er t
ime
step
(s
eco
nd
s)
Number Of Cores
p775
dx360M4
p460
0.00
5.00
10.00
15.00
20.00
25.00
128 256 512 1024 1536 2048 3072 4096 6144
Ave
rage
rea
d t
ime
per
rea
d t
ime
ste
p (
seco
nd
s)
Number Of Cores
p775
dx360M4
p460
0.00
5.00
10.00
15.00
20.00
25.00
128 256 512 1024 1536 2048 3072 4096 6144
Ave
rage
wri
te t
ime
pe
r w
rite
tim
e s
tep
(se
con
ds)
Number Of Cores
p775
dx360M4
p460
18
WRF GFLOP Rates
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
128 256 512 1024 1536 2048 3072 4096 6144Pe
rce
nt
(%)
Sust
ain
ed
of
Pe
ak P
erf
orm
ance
Number Of Cores
p775
dx360M4
p460
0.00
1000.00
2000.00
3000.00
4000.00
5000.00
6000.00
7000.00
8000.00
9000.00
128 256 512 1024 1536 2048 3072 4096 6144
GFL
OP
S S
ust
ain
ed
Number Of Cores
p775 GFLOPS Sustained
dx360M4 GFLOPS Sustained
p460 GFLOPS Sustained
Peak GFLOPS = Number of cores X 8 x Core frequency.
p775 Sustained GFLOPS = (10-9 / p775_run_time) * (PM_VSU_1FLOP+2*PM_VSU_2FLOP+4*PM_VSU_4FLOP+8*PM_VSU_8FLOP)
p460 Sustained GGLOPS = p775 Sustained GFLOPS * (p775_run_time / p460_run_time)
dx360M4 Sustained GFLOPS = p775 Sustained GFLOPS * (p775_run_time / dx360M4_run_time)
19
Conclusions
WRF scales and performs well on all tested systems.
Quilting I/O with netcdf works very well on all systems.
Parallel netcdf for data ingest improves data read times.
WRF is a popular single precision Code. • It runs very well on dx360M4 system.
-xAVX works very well.
Intel Compilers do a great job producing optimal and fast binaries.
numtiles ~cache block parameter for additional performance.
Hyperthreading gives no benefit towards overall performance.
Near neighbor communication is handled effectively by FDR IB.
• It runs ok on p775 and p460 systems. VSX does not work well. Code crashes if compiled with -qsimd
IBM XL compilers do OK with –O3 –qhot (Vector MASS library).
Thin rectangular decompositions work ok (caching and vector MASS).
SMT works well on p775, due to available memory-to-core BW.
Near neighbor communication an overkill for p775, but OK for p460.
• Performance odds were stacked against dx360M4. Runs with 6144 cores and different nproc_x, nproc_y yield even better performance.
2014 LENOVO. ALL RIGHTS RESERVED 16TH ECMWF HPC WORKSHOP .