Activities of IPCC at the Information Technology Center,The University of Tokyobased on the Showcase Presentation on May 5, 2018
Kengo NakajimaLead, Supercomputing Research DivisionInformation Technology CenterThe University of Tokyo, Japan & JCAHPCDeputy Director, RIKEN R -CCS
Intel Parallel Computing Centers Asia Summit 2018May 10-11, 2018, Chengdu, China
• Overview of IPCC at ITC/U.Tokyo–IPCC since 2014
• Achievements in FY.2017• Future Perspective
2
3Information Technology Center, The University of Tokyo (IPCC since 2014)
– Kengo Nakajima, PI– Toshihiro Hanawa– Takashi Shimokawabe– Akihiro Ida– Masatoshi Kawai (-> RIKEN)– Eishi Arima– Tetsuya Hoshino– Yohei Miki– Kenjiro Taura– Masashi Imano
Earthquake Research Institute, The University of Tokyo
– Takashi Furumura– Tsuyoshi Ichimura– Kohei Fujita
FAU, Germany– Gerhard Wellein
Atmosphere & Ocean Research Institute, The University of Tokyo
– Masaki Satoh– Hiroyasu Hasumi
RIKEN Advanced Institute for Computational Science (AICS)
– Hisashi Yashiro
Japan Agency for Marien-Earth Science & Technology (JAMSTEC)
– Masao Kurogi
Research Organization for Information Science & Technology (RIST)
– Takashi Arakawa
JAMA (Japan Automobile Manufacturers Association)
Supercomputers in ITC/U.Tokyo2 big systems, 6 yr. cycle
4
FY
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Yayoi: Hitachi SR16000/M1IBM Power-7
5459 TFLOPS, 1152 TB
Reedbush, HPEBroadwell + Pascal
1593 PFLOPS
T2, Tokyo140TF, 3153TB
Oakforest-PACSFujitsu, Intel ,NL25PFLOPS, 919.3TB
BDEC System50+ PFLOPS (?)
Oakleaf-FX: Fujitsu PRIMEHPC FX10, SPARC64 IXfx1513 PFLOPS, 150 TB
Oakbridge-FX13652 TFLOPS, 1854 TB
Reedbush-L HPE
1543 PFLOPS
Oakbridge-IIIntel/AMD/P9 CPU only
5+ PFLOPS
Integrated Supercomputer System for Data Analyses & Scientific Simulations
JCAHPC: Tsukuba, Tokyo
Big Data & Extreme Computing
Supercomputer System with Accelerators for Long-Term Executions
Now operating 2 (or 4)systems !!• Oakleaf-FX (Fujitsu PRIMEHPC FX10)
– 1.135 PF, Commercial Version of K, Apr.2012 – Mar.2018
• Oakbridge-FX (Fujitsu PRIMEHPC FX10)– 136.2 TF, for long-time use (up to 168 hr), Apr.2014 – Mar.2018
• Reedbush (HPE, Intel BDW + NVIDIA P100 (Pascal))– Integrated Supercomputer System for Data Analyses &
Scientific Simulations• Jul.2016-Jun.2020
– Our first GPU System, DDN IME (Burst Buffer)– Reedbush-U: CPU only, 420 nodes, 508 TF (Jul.2016)– Reedbush-H: 120 nodes, 2 GPUs/node: 1.42 PF (Mar.2017)– Reedbush-L: 64 nodes, 4 GPUs/node: 1.43 PF (Oct.2017)
• Oakforest-PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL))– JCAHPC (U.Tsukuba & U.Tokyo)– 25 PF, #9 in 50th TOP 500 (Nov.2017) (#2 in Japan)– Omni-Path Architecture, DDN IME (Burst Buffer)
5
JPY (=Watt)/(Peak)GFLOPS RateSmaller is better (efficient)
6
System JPY/GFLOPSOakleaf/Oakbridge-FX (Fujitsu)(Fujitsu PRIMEHPC FX10)
125
Reedbush-U (HPE)(Intel BDW)
62.0
Reedbush-H (HPE)(Intel BDW+NVIDIA P100)
17.1
Oakforest-PACS (Fujitsu)(Intel Xeon Phi/Knights Landing) 16.5
Oakforest-PACS: OFP• Full Operation started on December 1, 2016 • 8,208 Intel Xeon/Phi (KNL), 25 PF Peak Performance
– Fujitsu
• TOP 500 #9 (#2 in Japan), HPCG #6 (#2) (Nov. 2017)• JCAHPC: Joint Center for Advanced High
Performance Computing)– University of Tsukuba– University of Tokyo
• New system is installed at Kashiwa-no-Ha (Leaf of Oak) Campus/U.Tokyo, which is between Tokyo and Tsukuba
– http://jcahpc.jp
77
Software of Oakforest-PACS
• OS: Red Hat Enterprise Linux (Login nodes), CentOS or McKernel (Compute nodes, dynamically switchable)
• McKernel: OS for many-core CPU developed by RIKEN AICS
• Ultra-lightweight OS compared with Linux, no background noise to user program
• Expected to be installed to post-K computer
• Compiler: GCC, Intel Compiler, XcalableMP
• XcalableMP: Parallel programming language developed by RIKEN AICS and University of Tsukuba
• Easy to develop high-performance parallel application by adding directives to original code written by C or Fortran
• Application: Open-source softwares
• : ppOpen-HPC, OpenFOAM, ABINIT-MP, PHASE system, FrontFlow/blue and so on
81st JCAHPC Seminar@U5Tokyo
• ARTED– Ab-initio Electron
Dynamics
• Lattice QCD– Quantum Chrono
Dynamics
• NICAM & COCO– Atmosphere & Ocean
Coupling
• GAMERA/GHYDRA– Earthquake Simulations
• Seism3D– Seismic Wave Propagation
9
Applications on OFP
ppOpen -HPC: OverviewPrimary Target of IPCC Activities
• Application framework with automatic tuning (AT) “pp” : post-peta-scale
• Five-year project (FY.2011-2015) (since April 2011) P.I.: Kengo Nakajima (ITC, The University of Tokyo) Part of “Development of System Software Technologies for
Post-Peta Scale High Performance Computing” funded by JST/CREST (Supervisor: Prof. Mitsuhisa Sato, RIKEN AICS)
• Target: Oakforest-PACS (Original Schedule: FY.2015) could be extended to various types of platforms
10
• Team with 7 institutes, >50 people (5 PDs) from various fields: Co-Design
• Open Source Software http://ppopenhpc.cc.u-tokyo.ac.jp/ English Documents, MIT License
11
12
Featured Developments• ppOpen-AT: AT Language for Loop Optimization• HACApK library for H-matrix comp. in ppOpen-
APPL/BEM (OpenMP/MPI Hybrid Version)– First Open Source Library by OpenMP/MPI Hybrid
• ppOpen-MATH/MP (Coupler for Multiphysics Simulations, Loose Coupling of FEM & FDM)
• Sparse Linear Solvers
13
ESSEX-II: 2nd Phase of ppOpen -HPCJapan -Germany Collaboration (FY.2016-2018)
http://blogs.fau.de/essex/• JST-CREST (J) and DFG-SPPEXA (G)• Equipping Sparse Solvers for Exascale (FY.2013-15)• ESSEX + Tsukuba (Prof. Sakurai) + U.Tokyo
(ppOpen-HPC) = ESSEX-II– Leading PI: Prof. Gerhard Wellein (U. Erlangen)
• Mission of Our Group: Preconditioned Iterative Solver for Eigenvalue Problems in Quantum Science– DLR (German Aerospace Research Center)– U. Wuppertal, U. Greifswald, Germany
14
15
Activities in FY.2016/2017 (1/4)• Linear Solvers/FEM Matrix Assembly in ppOpen-HPC
– Done in FY.2016/2017, Further Optimization Needed• Results of FEM Mat. Assembly
• 3D Seismic Simulations– Seism3D: FDM
• Not yet done
– GOJIRA/GAMERA/GHYDRA: FEM• Done in FY.2016/2017, Excellent Performance• HPC Asia 2018 Best Paper, IPDPS 2018• Not accepted for SC17 Gordon Bell
• ppOpen-MATH/MP (Coupler)– Done in FY.2017/2017– El Nino Simulations on K computer
• Atmosphere-Ocean Coupling (NICAM-COCO) – Done: Serial Opt. in FY.2016, Parallel Opt. in FY.2017– Good Scalability
16
Activities in FY.2016/2017 (2/4)• Multigrid Solver (ppOpen-MATH/MG)
– Done, Further Optimization Needed• HACApK for H-Matrix
– 1st Open Source: OpenMP/MPI Hybrid– Done, Further Optimization Needed– HPC Asia 2018, IPDPS 2018, ICCS 2018
• Other Activities– Chebyshev Filter Diagonalization
• Joint Work with G. Wellein (FAU, Germany) etc. under ESSEX-II Project
• Comparisons of OFP and PizDaint• ISC-HPC 2018 Hans Meuer Award Finalist
– Dynamic Loop Scheduling• Significant Improvement of Performance (20+%)• Problems in Strong Scaling (MPI: Point-to-Point/Collective Comm.)• ICPP 2017 (P2S2)
17
Activities in FY.2016/2017 (3/4)• Other Activities (cont.)
– Algebraic Multigrid Method using Near Kernel Vectors• 3D Solid Mechanics Applications• Preliminary Evaluation on OFP (Flat MPI)• SC17 Research Poster
– ChainerMN• Deep Learning Framework for Distributed Parallel Environments• Developed by PFN, Japan• Implementation to OFP• SIAM PP18 Poster
– OpenFOAM• Widely Used Open Source Applications for CFD• Pipelined Algorithms• Development of OpenMP/MPI Hybrid Version
– Tutorials• OpenFOAM• Introduction to KNL
18
Publications in FY.2017 (1/2)• K. Nakajima, T. Hanawa, Communication-Computation Overlapping with
Dynamic Loop Sched-uling for Preconditioned Parallel Iterative Solvers on Multicore/Manycore Clusters, IEEE Proceed-ings of 10th International Workshop on Parallel Programming Models & Systems Software for High-End Computing (P2S2 2017) in conjunction with the 46th International Conference on Paral-lel Processing (ICPP 2017), Bristol, UK, 2017
• N. Nomura, K. Nakajima, A. Fujii, Robust SA-AMG Solver by Extraction of Near-Kernel Vectors, ACM Proceedings of SC17 (The International Conference for High Performance Computing, Networking, Storage and Analysis) (Research Poster), Denver, CO, USA, 2017
• M. Kawai, A. Ida, K. Nakajima, Hierarchical Parallelization of MulticoloringAlgorithms for Block IC Preconditioners, IEEE Proceedings of the 19th International Conference on High Per-formance Computing and Communications (HPCC 2017), 138-145, Bangkok, Thailand, 2017
• K. Fujita, K. Katsushima, T. Ichimura, M. Horikoshi, K. Nakajima, M. Hori, L. Maddegedara, Wave propagation simulation of complex multi-material problems with fast low-order unstructured finite-element meshing and analysis, ACM Proceedings of HPC Asia 2018, Tokyo, Japan, 2018 (Best Paper Award)
• A. Ida, H. Nakashima, M. Kawai, Parallel Hierarchical Matrices with Block Low-rank Representation on Distributed Memory Computer Systems, ACM Proceedings of HPC Asia 2018, Tokyo, Japan, 2018
19
Publications in FY.2017 (2/2)• M. Horikoshi, L. Meadows, T. Elken, P. Sivakumar, E. Mascarenhas, J.
Erwin, D. Durnov, A. Sannikov, T. Hanawa, T. Boku, Scaling collectives on large clusters using Intel (R) architecture processors and fabric, Intel eXtreme Performance Users Group Workshop in conjunction with HPC Asia 2018, Tokyo, Japan, 2018
• K. Tamura, T. Hanawa, Performance Evaluation of Large-Scale Deep Learning Framework ChainerMN on Oakforest-PACS, SIAM Conference on Parallel Processing for Scientific Computing (PP18), Tokyo, 2018
• T. Hoshino, A. Ida, T. Hanawa, K. Nakajima, Design of Parallel BEM Analyses Framework for SIMD Processors, Proceedings of 2018 International Conference on Computational Science (ICCS 2018), 2018 (in press)
• T. Ichimura, K. Fujita, M. Horikoshi, L. Meadows, K. Nakajima et al., A Fast Scalable Implicit Solver with Concentrated Computation for Nonlinear Time-evolution Problems on Low-order Unstructured Finite Elements, Proceedings of 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), 2018 (in press)
• Akihiro Ida, Lattice H-Matrices on Distributed-Memory Systems, IPDPS 2018, 2018 (in press)
• M. Kreutzer, D. Ernst, A.R. Bishop, H. Fiske, G. Hager, K. Nakajima, G. Wellein, Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs, Proceedings of ISC High Performance 2018 (Hans MeuerAward Finalist), 2018 (in press)
20
Major Invited Talks in FY.2017• K. Nakajima, A. Ida, M. Kawai, T. Katagiri, Development of Large-Scale
Scientific & Engineer-ing Applications on Post-Peta/Exascale Systems using ppOpen-HPC, APSCIT 2017 Annual Meet-ing (Asia Pacific Society for Computing and Information Technology (Sapporo, Hokkaido, Japan, July 29, 2017)
• K. Nakajima, Application Development Framework for Manycore Architectures on Post-Peta/Exascale System, ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems in conjunction with SC17 (The International Conference for High Perfor-manceComputing, Networking, Storage and Analysis) (Denver, CO, USA, November 13, 2017)
• K. Nakajima, Application Development Framework for Manycore Architectures -from Exascale to Post Moore Era-, International Workshop for High Performance Numerical Algorithms and Applications (HPNAA) (Sanya, China, January 8-12, 2018)
• K. Nakajima, Computational Science and Engineering in Exascale and Post Moore Era, JAMSTEC‐DKRZ Workshop: Open Workshop: User Services at Earth Science HPC Centers (Yokohama, Japan, March 19, 2018)
• Overview of IPCC at ITC/U.Tokyo• Achievements in FY.2017• Future Perspective
21
• NICAM-COCO– Yashiro, Kurogi, Arakawa, Nakajima
• Earthquake Simulations • Chebyshev Filter Diagonalization • Multigrid Method• H-Matrix• Dynamic Loop Scheduling• Matrix Assembly in FEM• ChainerMN• OpenFOAM
22
Atmosphere -Ocean Coupling on OFP by NICAM/COCO/ppOpen-MATH/MP
• High-resolution global atmosphere-ocean coupled simulation by NICAM and COCO (Ocean Simulation) through ppOpen-MATH/MP on the K computer is achieved. – ppOpen-MATH/MP is a coupling software for the
models employing various discretization method.
• An O(km)-mesh NICAM-COCO coupled simulation is planned on the Oakforest-PACS system. – A big challenge for optimization of the codes on new
Intel Xeon Phi processor– New insights for understanding of global climate
dynamics
23
[C/O M5 Satoh (AORI/UTokyo)@SC16]
Dataflow of ppOpen -MATH/MP*24
[M. Matsumoto et al. 2015]
25
Works in FY.2016• NICAM
– Original Code (40K+ lines): MPI + Auto Parallel/K compiler– OpenMP statements applied since November 2016– Problems
• Fork-Join Overhead, Load Imbalancing• MPI Initialization Overhead• Numerical Instability
• COCO– OpenMP + MPI– Some of the operations were not parallelized– Generally, good improvement
• Coupling– 56km NICAM + 0.25-deg COCO for 1-day: done (not stable)– 3.5km NUCAM + 0.10-deg COCO using 7,230 nodes of
OFP (for 1-month in 3.5-day computations)
26
Optimization of NICAM56-km: 32-nodes, 640-proc’s
28-km: 128-nodes, 2,560-proc’ssupported by Engineers of Intel
0.0
2.0
4.0
6.0
8.0
10.0
sec.
Feb.07-56km
Feb.07-28km
Feb.22-56km
Feb.10-28km
27
Code on Feb. 8, 2017: Generally, codes with fewer t hreads are faster
OLD
COCO (1440x1200x62)
60 node
Feb. 9
COCO (1440x1200x62)
60 node
Latest
COCO (1440x1200x62)
60 node
64 MPI (flat) 8OMPx8MPI 64 MPI (flat) 8OMPx8MPI 64 MPI (flat) 8OMPx8MPI
Main loop
(960 step)
main 1175208 146555 106.399 126.507 97.118 121.405
tracer 555145 64534 53.311 57.369 50.301 59.403
brcliic 12556 195074 12.552 17.571 12.449 18.026
brtropic 135666 195895 10.9 17.549 11.591 17.267
surface flux 85537 85076 9.629 10.804 4.84 4.744
vel diag.+
vertical diff. 145833 175129 14.529 14.469 11.749 13.34
sea ice 45152 65772 4.097 6.847 4.547 6.76
history out 25781 05088 0.085 0.073 0.088 0.075
other 55535 115175 1.297 1.825 1.555 1.788
Initial setup (I/O) 535431 285309 455365 255282 43.879 24.479
Terminal treatment (I/O) 995434 785926 985156 735943 98.363 77.291
From MPI_INIT to
MPI_FINALIZE 2705125 25357962495928 2255736
239.37 223.177
Elapse time 305 260 284 232 274 229
By merging the code which I received from Horikoshi-san, COCO became faster5
Virtual MPI rank was introduced
OpenMP directive was refined5
Time measurement of 0.25°COCO
28
Time measurement of 0.1°COCO
0.25°COCO
1440x1200x62
0.1°COCO
3600x3000x62
60 node 375 node 1125 node
64 MPI (flat) 8OMPx8MPI 64 MPI (flat) 4OMPx16MPI 8OMPx8MPI 4OMPx16MPI
Main loop
(960 step)
main 97.118 121.405 1175522 1145275 1315839 775346
tracer 50.301 59.403 505444 495215 605758 205603
brcliic 12.449 18.026 135087 145614 185257 45950
brtropic 11.591 17.267 125431 175975 205176 165996
surface flux 4.84 4.744 215189 115026 95268 195333
vel diag.+
vertical diff. 11.749 13.34 125230 115852 135643 65804
sea ice 4.547 6.76 55088 75087 75592 65264
history out 0.088 0.075 05235 05148 05172 05323
other 1.555 1.788 25818 25358 15975 25073
Initial setup (I/O) 43.879 24.479 2025445 1775477 1705517 1905236
Terminal treatment (I/O) 98.363 77.291 7975995 4395936 4385548 10535745
From MPI_INIT to
MPI_FINALIZE 239.37 223.177 11175971 7315690 7405907 13215333
Elapse time 274 229 1365 751 752 1401
Hybrid is faster than flat MPI5
It is only 1548 times faster when 3 times nodes are used5
For 1 month calculation, it takes 05246 wall clock hour (1457 minutes) by using 375 nodes5
Considering NICAM takes 72 wall clock hour, 1528 (=375×05246/72) may be the ideal node number for
COCO (though memory is insufficient) in NICOCO simulation5
29
NICAM-COCO Coupling with Higher-Resolution
• Results in FY.2017– Strong Scaling: 224km/56km atmos. + 1.0deg ocean– One-Year Simulation for 14km (NICAM) + 0.1 deg (COCO)– On-Hour Simulation for 3.5km (NICAM) + 0.1 deg (COCO)
• Optimization & Performance Evaluation of MPMD Climate Application (NICAM (Atmosphere) + COCO (Ocean)) on OFP with ppOpen-MATH/MP
El Niño Simulations (1/3)U.Tokyo , RIKEN
Mechanism of the Abrupt Terminate of Super El Niño in 1997/1998 has been revealed by Atmosphere-Ocean Coupling Simulations for the Entire Earth
using ppOpen -HPC on the K computer
31
El Niño Simulations (2/3)U.Tokyo , RIKEN
Mechanism of the Abrupt Terminate of Super El Niño in 1997/1998 has been revealed by Atmosphere-Ocean Coupling Simulations for the Entire Earth
using ppOpen -HPC on the K computer
32
El Niño Simulations (3/3)U.Tokyo , RIKEN
• A Madden-Julian Oscillation (MJO) event remotely accelerates ocean upwelling to abruptly terminate the 1997/1998 super El Niño
• NICAM (Atmosphere)-COCO (Ocean) Coupling by ppOpen-MATH/MP
33
• Data Assimilation + Simulations– Integration of Simulations + Data Analyses
• NICAM: 14km – COCO 0.25-1.00 deg. on K• Next Target: 3.5km -0.10 deg. (5B Meshes) on OFP
ppOpen-MATH/MP
NICAM Strong Scaling in FY.2017
• Good Scalability if Grid Size is Sufficiently Large
0.4
1.12.7
4.1
0.01
0.03
0.12
0.401.01 1.36
0.001
0.01
0.1
1
10
100
1 10 100 1000 10000 100000
SY
PD
(SIM
ULA
TIO
N Y
EA
R
PE
R W
ALL
-CLO
CK
DA
Y)
# OF MPI PROCESS
224km(ideal) 224km 56km(ideal) 56km
Challenges in FY.2018: NICAM -COCO Coupling with Higher-Resolutions
• 3.5km (NICAM) + 0.1 deg (COCO) (5B Meshes)• I/O Coupling• NICAM/COCO/IO on a Same CPU for Efficient
Use of Computational Resources– Mostly, computations are for atmosphere (NICAM)
• NICAM-COCO• Earthquake Simulations
– Ichimura, Fujita, Furumura, Nakajima– HPC Asia 2018 (Best Paper), IPDPS 2018
• Chebyshev Filter Diagonalization • Multigrid Method• H-Matrix• Dynamic Loop Scheduling• Matrix Assembly in FEM• ChainerMN• OpenFOAM
36
Code by Prof. T. Ichimura & K. Fujita (ERI/U.Tokyo ) for Earthquake Simulations• Finite Element Method
Tetrahedral Elements (2nd Order)Nonlinear/Linear, Dynamic/Static Solid MechanicsMixed Precision, EBE-based Multigrid
• SC14 & SC15: Gordon Bell Finalist• SC16: Best Poster Award, Best Paper Award
(WACCPPD 2016) (GPU)• SC17: Best Poster Award, WACCPPD 2017: Best
Poster Award• HPC Asia 2018: Best Paper Award• IPDPS 2018 Paper
37
Earthquake Wave Propagation Analysis
38
a) Earthquake wave propagation
-7 km
0 km
c) Resident evacuation
b) City response simulation
Shinjuku
Two million agents evacuating to nearest safe site
Tokyo station
Ikebukuro
Shibuya
Shinbashi
UenoEarthquake Post earthquake
Background for Designing New Algorithm
• Wider SIMD in recent architecture– 128 bit: 2 in double precision (2 in single precision) on K
computer– 512 bit: 8 in double precision (16 in single precision) on OFP
[Post K computer also 512 bit]
• Smaller B/F ratio (memory transfer capability)– (5.3 Pbyte/s) / (10.5 PFLOPS) = 0.5 for K computer– (4 PByte/s) / 25 PFLOPS) = 0.16 for Oakforest-PACS
• Random access and data transfer in sparse matrix-vector product kernels becomes bottleneck
• How to use these wider SIMD in finite-element methods while circumventing memory transfer bottlenecks?
39
Design of New Algorithm: GAMERA• Introduce time-parallel concentrated computation
algorithm in solving Kui = fi (i=1,2, …, nt)– Predict solution of future time steps by solving several time
steps together– Reduces solver iterations by increasing compute cost per
iteration– Total FLOP count is not changed much, however, a more
efficient kernel can be used
40
Kernel in standard solver:Sparse matrix vector product kernel (SpMV)
=
fi fi+1 fi+2
Future time stepsCurrent time step
Kernel in new solver: less random access(GSpMV kernel)
fi+3
……
Contiguous in memory (SIMD efficient )
Contiguous in memory(SIMD efficient )
ui ui+1 ui+2 ui+3
=
41
1: set Ă 2: for( )3: guess using standard predictor4: set using 5: solve Ă using initial solution (Computed using iterative solver with SpMV kernel)6:
1: set Ă and Ă ( )2: for( )3: set using 4: guess using standard predictor5:
6: while (
) do
7: guess using ( )8: refine solution Ă with initial solution ( ) (Computed using iterative
solver with GSpMV kernel)10: 11: 11:
Standard solver algorithm
Developed algorithm (in case of 4 vectors)
Performance of New AlgorithmHPC Asia 2018 Best Paper
42
• Test on linear wave propagation problem solved using a conjugate gradient solver using 3x3 block diagonal preconditioning (double precision)– Number of iterations reduced from 9,068 to 3,541 by using
the developed method
20 coresx 2 socketsx 2 nodes@ 2.4 GHz
12 coresx 2 socketsx 4 nodes@ 2.6 GHz
68 coresx 2 nodes@ 1.4 GHz
8 coresx 8nodes@ 2.0 GHz
Peak DP FLOPS 3072 GFLOPS 3993.6 GFLOPS 6092.8 GFLOPS 1024 GFLOPS
Peak memory bandwidth
512 GB/s 544 GB/s 980 GB/s (MCDRAM)230 GB/s (DDR4)
512 GB/s
326.2 314.5 233.0
365.6
229.4 224.8 138.0
244.3
0
100
200
300
400
Skylake-SP XeonGold
Haswell Xeon E5 Oakforest-PACS K computer
Without time-parallel With time-parallel
Tota
l ela
psed
tim
e (s
)
1.42 x faster
1.40 x faster
1.50 x faster
1.69 x faster
Weak Scaling on Oakforest-PACSHPC Asia 2018 Best Paper
• Scalability improved due to reduced number of communication messages– 91.3% weak scaling efficiency from 256 to 8,192 nodes– x1.55 speedup by using new algorithm on full system
43
0
50
100
150
200
250
300
256
Without time-parallelWith time-parallel
153.
3 s
229.
2 s
162.
2 s
255.
3 s
Number of compute nodes
512 1024 2048 4096
151.
6 s
230.
8 s
152.
4 s
236.
5 s
159.
2 s
242.
6 s
168.
0 s
262.
0 s
8192
Tota
l ela
psed
tim
e (s
)
Strong Scaling on Oakforest-PACS
• 1.5–1.6 times speedup for all node counts tested
44
16
32
64
128
256
512
128 256 512 1024 2048
Without time-parallel
With time-parallel
Number of compute nodes
Tota
l ela
psed
tim
e (s
)
OFP is better than K
for Weak Scaling but …
45
0
50
100
150
200
250
300
256
Without time-parallelWith time-parallel
153.
3 s
229.
2 s
162.
2 s
255.
3 s
Number of compute nodes
512 1024 2048 4096
151.
6 s
230.
8 s
152.
4 s
236.
5 s
159.
2 s
242.
6 s
168.
0 s
262.
0 s
8192To
tal e
laps
ed t
ime
(s)
OFP
0
50
100
150
200
250
300
350
400
450
500
1024
251.
2 s
[18.
2%] (
69.9
%)
356.
1 s
[7.6
%] (
70.6
%)
Number of compute nodes2048 4096 8192 16384
311.
5 s
[14.
5%] (
55.7
%)
454.
9 s
[6.0
%] (
55.5
%)
32768
Tota
l ela
psed
tim
e (s
) K
4.19 PF
3.07 PF
• NICAM-COCO• Earthquake Simulations • Chebyshev Filter Diagonalization • Multigrid Method
– GMG: Geometric Multigrid• Nakajima
– SA-AMG: Smoothed Aggregation Algebraic Multigrid
• H-Matrix• Dynamic Loop Scheduling• Matrix Assembly in FEM• ChainerMN• OpenFOAM 46
Optimization of Parallel Comm.CGA(Coarse Grid Aggregation) -> hCGA
47
Level=1
Level=2
Level=m-3
Level=m-2
Level=m-1
Level=mMesh # foreach MPI= 1
Fine
Coarse Coarse grid solver on a single core (further multigrid)
Original CGA
hCGA
GW Flow Simulation with up to 4,096 nodes on
Fujitsu FX10 (GMG -CG)up to 17,179,869,184 meshes (64 3
meshes/core)
48
5.0
7.5
10.0
12.5
15.0
100 1000 10000 100000
sec.
CORE#
CGA
hCGA
x1.61
0
20
40
60
80
100
120
1024 8192 65536
Par
alle
l Per
form
ance
(%)
CORE#
Flat MPI:C3 Flat MPI:C4
x6.27
hCGAStrong Scaling
hCGAWeak Scaling
Preliminary Results on OFP with
8,192 nodesup to 34B
DOF
Performance of Coarse Grid Solver is not
good
49
2.5
5.0
7.5
10.0
12.5
15.0
1.E+01 1.E+02 1.E+03 1.E+04
sec.
Node#
Flat MPI:FX10Flat MPI:FX10-hCGAHB 4x16: OFPHB 4x16: OFP-hCGA
2.0
4.0
6.0
8.0
1.E+02 1.E+03 1.E+04
sec.
Node#
HB 4x16: OFPHB 4x16: OFP-hCGAHB 8x8: OFPHB 16x4: OFP
Preliminary Results on OFP with
8,192 nodesup to 34B
DOF
Performance of Coarse Grid Solver is not
good
50
2.0
4.0
6.0
8.0
1.E+02 1.E+03 1.E+04
sec.
Node#
HB 4x16: OFPHB 4x16: OFP-hCGAHB 8x8: OFPHB 16x4: OFP
5.0
7.5
10.0
12.5
15.0
100 1000 10000 100000
sec.
CORE#
Flat MPI:C3HB 4x4:C3HB 8x2:C3HB 16x1:C3
FX10
OFP
AM-hCGA: Adaptive Multilevel hCGAhCGA: only 2-layers
More hierarchical levels needed for more process #
51
1st
Layer
2nd Layer
3rd Layer
• NICAM-COCO• Earthquake Simulations • Chebyshev Filter Diagonalization • Multigrid Method• H-Matrix
– Ida, Hoshino– HPC Asia 2018, IPDPS 2018, ICCS 2018
• Dynamic Loop Scheduling• Matrix Assembly in FEM• ChainerMN• OpenFOAM
52
Approximation for dense matrices・ℋ-matrices:"($%) ⇒ "($()*$)
53
Enhancement of ACApK ℋACApK : Library for simulations using BEM
Top 3 achievements in 2017
Example analyses
Porting to clusters equipped KNL・GPU
New framework design for SIMD vectorization
Improvement of ℋ-matrices for MPP
Future work
Programming model : MPI+OpenMP
Complex ℋ-matrices arithmetic on clusters equipped KNL・GPU・Matrix-matrix multiplication ・LU factorisation ・Matrix inversionEigenvalue problems (Application to deep learning)
Ԃ Enhancement issue Available functions・ℋ-matrix construction・ℋ-matrix-vector・Iterative linear solver
Parallel ℋ-matrix generation
Parallel ℋ-matrix linear solver
User-defined function
ℋACApK API
Library Design of ACApK
User Program
Main
real(8) function user_func(i,j,st_bemv)integer :: i,jtype(hacapkInput) :: st_bemv
end function user_func
i
j
calculate i,j element of coefficient matrix
subroutine hmat_generation(st_bemv)type(hacapkInput) :: st_bemv!$omp parallel do do submat = 1, numSubmatdo j = stl, enl
!$omp simddo i = str, enra(i,j) = user_func(i,j,st_bemv)
end do end do
end do end subroutine hmat_generation
ℋACApK API calls user-defined function
User program calls ℋACApKAPI
User-defined function is black box. Is this SIMD directive work well?
User-defined function depends on target integral equation
T1
T2
T3
54
New framework design for SIMD vectorization
real(8) function vectorize_func(arg1,arg2,…)!$omp declare simd(vectorize_func) &!$omp simdlen(SIMDLENGTH) &!$omp linear(ref(arg1, arg2, …)) real(8), intent(in) :: arg1,arg2,…
! User defined calculation
end function user_func
subroutine hmat_generation(st_bemv)type(hacapkInput) :: st_bemvreal(8),dimension(SIMDLENGTH) :: ansreal(8),dimension(SIMDLENGTH) :: arg1,arg2,...
do j = stl, enldo i = str, enr, SIMDLENGTH
ii = 1do jj = i, min(i+SIMDLENGTH-1, enr)
call set_args(i,j,st_bemv,arg1(ii),arg2(ii),…)ii = ii+1
end do
!$omp simddo ii = 1, SIMDLENGTH
ans(ii) = vectorize_func(arg1(ii),arg2(ii),…)end do ii = 1do jj=i,min(i+SIMDLENGTH-1, enr)
a(i,j) = ans(ii)ii = ii+1
end do end do
end do
end subroutine hmat_generation
subroutine set_args(i,j,st_bemv,arg1,arg2,...)integer, intent(in) :: i, jtype(hacapkInput) :: st_bemvreal(8), intent(out) :: arg1,arg2,...
! Data copy from structure to scalar arguments
end subroutine set_args
2 interfaces for vectorization
Data access
Computation This loop is obviously vectorizable
This loop is sequentially executed
Framework automatically inserts the declare simd directive
Auto-transformation
T. Hoshino et al. “Design of Parallel BEM Analyses Framework for SIMD Processors” (ICCS 2018)
Insert declare SIMD directive etc. Users don’t have to consider SIMD
55
Numerical Evaluations
1V 0.25m
Ground
,[.]() ∶ 1
23 4. 5 d5 ∈ Ω
Ω
D[.]() ∶ 149(4)
23 4 : . 5 d5 ∈ ΩΩ
Test model of electrostatic field analysis
Perfect conducting sphere
Dielectric sphere
User-defined functions depend on these integral equations
02468
10121416
PerfectBDW
DierectricBDW
PerfectKNL
DierectricKNL
Elap
sed
time
[sec
]
Coefficient H-matrix generation on BDW and KNL
Original SIMD design
2.2x
1.9x
4.3x
4.1x
Evaluation Environments
• including branch divergence
BDW : Intel Xeon E5-2695 v4, 18 core
KNL : Intel Xeon Phi 7250, 68 core
Compiler : Intel compiler 18.0.1 • -qopenmp -O3 -ipo -align array64byte
-xAVX2 (BDW) –xMIC-AVX512 (KNL)
Performance comparison BDW : approximately 2x speedup
KNL : over 4x speedup
• In the case of dense matrix generation, new design achieved at most 6.6x speedup
56
• NICAM-COCO• Earthquake Simulations • Chebyshev Filter Diagonalization • Multigrid Method• H-Matrix• Dynamic Loop Scheduling
– Nakajima– ICPP 2017 (P2S2)
• Matrix Assembly in FEM• ChainerMN• OpenFOAM
57
Comm.-Comp. Overlapping(CC-Overlapping): Static
58
call MPI_Isendcall MPI_Irecv
do i= 1, Ninn(calculations)
enddocall MPI_Waitall
do i= Ninn+1, Nall(calculationas)
enddoGood for StencilNot so Effective SpMV
Pure Internal Meshes
External (HALO) Meshes
Internal Meshes on Boundary’s(Boundary Meshes)
Comm.-Comp. Overlapping+ Dynamic Loop Scheduling: Dynamic
CC-Overlapping 59
call MPI_Isendcall MPI_Irecvcall MPI_Waitall
do i= 1, Ninn(calculations)
enddo
do i= Ninn+1, Nall(calculationas)
enddo
Master
Dynamic
Static
Pure Internal Meshes
External (HALO) Meshes
Internal Meshes on Boundary’s(Boundary Meshes)
60
Dynamic Loop Scheduling• “dynamic”
• “!$omp master~!$omp end master”!$omp parallel private (neib,j,k,i,X1,X2,X3,WVAL1,WVAL2,WVAL3)!$omp& private (istart,inum,ii,ierr)
!$omp master Communication is done by the master thread (#0)!C!C– Send & Recv.(…)
call MPI_WAITALL (2*NEIBPETOT, req1, sta1, ierr)!$omp end master
!C The master thread can join computing of internal!C-- Pure Internal Nodes nodes after the completion of communication
!$omp do schedule (dynamic,200) Chunk Size= 200do j= 1, Ninn(…)
enddo!C!C-- Boundary Nodes Computing for boundary nodes are by all threads
!$omp do default: !$omp do schedule (static)do j= Ninn+1, N(…)
enddo
!$omp end parallel
Ina, T., Asahi, Y., Idomura, Y., Development of optimization of stencil calculation on Tera-flops many-core architecture, IPSJ SIG Technical Reports 2015-HPC-152-10, 2015 (in Japanese)
61
Code Name KNL BDW FX10
Architecture
Intel Xeon Phi 7250
(Knights Landing)
Intel Xeon E5-2695 v4
(Broadwell-EP)
SPARC IX fx
Frequency (GHz) 1.40 2.10 1.848
Core # (Max Thread #) 68 (272) 18 (18) 16 (16)
Peak Performance (GFLOPS)
3,046.4 604.8 236.5
Memory (GB) MCDRAM: 16DDR4: 96 128 32
Memory Bandwidth(GB/sec., Stream Triad)
MCDRAM: 490
DDR4: 80.165.5 64.7
Out-of-Order Y Y N
System Oakforest-PACS Reedbush-U Oakleaf-FX
62
Code Name KNL BDW FX10
Architecture
Intel Xeon Phi 7250
(Knights Landing)
Intel Xeon E5-2695 v4
(Broadwell-EP)SPARC IX fx
Frequency (GHz) 1.40 2.10 1.848
Core # (Max Thread #) 68 (272) 18 (18) 16 (16)
Peak Performance (GFLOPS)/core
44.8 33.6 14.8
Memory Bandwidth(GB/sec., Stream Triad)/core
MCDRAM: 7.21
DDR4: 1.243.64 4.04
Out-of-Order Y Y N
Network Omni-PathArchitecture
Mellanox EDRInfiniband
Tofu6D Torus
63
GeoFEM/Cube• Parallel FEM Code (&
Benchmarks)• 3D-Static-Elastic-Linear (Solid
Mechanics)• Performance of Parallel
Preconditioned Iterative Solvers– 3D Tri-linear Hexahedral
Elements– Block Diagonal LU + CG– Fortran90+MPI+OpenMP– Distributed Data Structure
– MPI,OpenMP,OpenMP/MPI Hybrid
– Block CRS Format
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-direction @ z=Zmax
(Ny-1) elementsNy nodes
(Nx-1) elementsNx nodes
(Nz-1) elementsNz nodes
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-direction @ z=Zmax
(Ny-1) elementsNy nodes
(Nx-1) elementsNx nodes
(Nz-1) elementsNz nodes
( )1,1,1 −+−
( ) ( )1,1,1,, −−−=ζηξ1 2
34
5 6
78
( )1,1,1 −−+
( )1,1,1 −++
( )1,1,1 +−− ( )1,1,1 +−+
( )1,1,1 +++( )1,1,1 ++−
i
j
( )81,
333231
232221
131211
…=
ee
jiji
i
jiji
ji
aaa
aaa
aaa
eeeeejei
ejeiejeiejei
eeejeiee
Preliminary Results: Best Cases3,840 cores, 368,640,000 DOF
Improvement of CG from Original Cases
64
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
25.00
Spe
ed-U
p (%
)
FX10: HB 16x1 BDW: HB 8x2 KNL: HB 64x2 (2T)
Preliminary Results: Original Cases3,840 cores, 368,640,000 DOF
Communication Overhead by Collective/Point-to-Point Communications
65
0%
20%
40%
60%
80%
100%
FX10: HB 16x1 BDW: HB 8x2 KNL: HB 64x2(2T)
Rest Send/Recv Allreduce
7.5% 4.0% 14.6%
66
Features
Effect of Dynamic
Scheduling
Optimum Chunk Size
Notes
FX10 Medium 100 Memory Throughput
BDW Small 500+ Low Comm. OverheadSmall number of threads.
KNL Large 300-500
Effects are significant for HB 64x2, 128x1, where loss of performance by communications on master thread is rather smaller.
67
Target Problem
• 2563 FEM Nodes, 50,331,648 DOF• Strong Scaling
– FX10: 2-1,024 nodes (32-16,384 cores)
– BDW: 2-512 sockets, 1-256 nodes (32-8,192 cores)• Reedbush-U has only 420 nodes of BDW
– KNL: 4-256 nodes (256-16,384 cores)
Core#
Spe
ed-U
p
Ideal Line= 100%< 100%
> 100%• Parallel Performance
– 100%: on the ideal line
– < 100%: BELOW
– > 100%: ABOVE
Strong ScalingParallel Performance
(%)BEST case for each
HB MxN50,331,648 DOF
Computation Time of Flat MPI at Min.# cores:
100%
68
0.0
20.0
40.0
60.0
80.0
100.0
120.0
32 64 128 256 512 1024 2048 4096 8192 16384
Par
alle
l Per
form
ance
(%)
Total Core #
Flat MPI: Original HB 2x8: Csz=500 HB 4x4: Csz=100HB 8x2: Csz=100 HB 16x1: Csz=500
0.0
20.0
40.0
60.0
80.0
100.0
120.0
32 64 128 256 512 1024 2048 4096 8192
Par
alle
l Per
form
ance
(%)
Total Core #
Flat MPI: Static HB 2x8: Csz=500 HB 4x4: Csz=500HB 8x2: Csz=100 HB 16x1: Csz=500
0.0
20.0
40.0
60.0
80.0
100.0
120.0
256 512 1024 2048 4096 8192 16384
Par
alle
l Per
form
ance
(%)
Total Core #
Flat MPI: Original HB 2x32: Csz=500 HB 4x16: Csz=100 HB 8x8: Csz=100HB 16x4: Csz=500 HB 32x2: Csz=500 HB 64x1: Csz=500
FX10
BDW
KNL
BDW (Reedbush): IB -EDRKNK (OFP): OPA
Problems of Performance of MPI on OPA
0.0
20.0
40.0
60.0
80.0
100.0
120.0
32 64 128 256 512 1024 2048 4096 8192 16384
Par
alle
l Per
form
ance
(%)
Total Core #
HB 16x1: Original Static Csz=100 Csz=500
Strong ScalingParallel Performance
(%)Effect of Dynamic Loop
Scheduling50,331,648 DOF
Computation Time of Flat MPI at Min.# cores:
100%
69
0.0
20.0
40.0
60.0
80.0
100.0
120.0
32 64 128 256 512 1024 2048 4096 8192
Par
alle
l Per
form
ance
(%)
Total Core #
HB 8x2: Original Static Csz=100 Csz=500
FX10: HB 16x1
BDW: HB 8x2
0.0
20.0
40.0
60.0
80.0
100.0
120.0
256 512 1024 2048 4096 8192 16384
Par
alle
l Per
form
ance
(%)
HB 64x1/1T: Original Static Csz=100 Csz=500
KNL: HB 64x1 (1T)
Effect of Dynamic Loop Schedulingwith more than 8,192 cores
• FX10: 20%-40%• BDW: 6%-10%• KNL with HB 8×8 (1T): 20%-30%• KNL with HB 64×1 (1T): 40%-50%
Summary• Excellent Performance Improvement by CC-
Overlapping + Dynamic Loop Scheduling• OFP
– Problems in Strong Scaling: General Issues
– Performance of MPI_Allreduce for Many MPI Processes
– Improvement of Intel MPI• Collaboration with Intel Engineers
• Multiple End Point
70
Preliminary Results of Multiple EPGood for Larger-Scale Problems
NOT good for Strong Scaling
71
• Overview of IPCC at ITC/U.Tokyo• Achievements in FY.2017• Future Perspective
72
Near Future Plan of ITC/U.Tokyo• Development of New Research Area needed
– Mostly, CSE (Computational Science & Engineering) – Data Analysis, Deep Learning, Artificial Intelligence
• Genome Analysis, Medical Image Recognition: already started: U.TokyoHospital, Research Organization for Genome Medical Sci.
• Real-time data analysis for GOSAT satellite (2008-2014) on T2K
• Integration of CSE & Data Science– Users in Atmosphere/Ocean Sci., Earthquake Sci./Eng.,
Fluid/Structure Simulations, Material Sci., utilize observed/experimental data for validation of simulations.
– Development of Method & Infrastructure for Efficient Utilization of “Big Data”
– Data Driven Approach
• BDEC (Big Data & Extreme Computing) with 50+PF after Fall 2020– Reedbush is the prototype of the BDEC System
73
Plans in 2018/2019 (1/3)• Further Optimization of Sparse Linear Solvers
– Serial/Parallel Performance– Point-to-Point/Collective Communications in MPI
• Improvement of MPI: e.g. Multiple End Points (with Dr. Horikoshi)
• Large-Scale Simulations– NICAM-COCO (Atmosphere-Ocean)
• El-Nino Simulations with 3.5km-1-deg Model
– Seism3D (Seismic Wave Propagation, FDM)• Loop-Tiling for Stencil-Computations in FDM• Space + Time• Higher Density of Computation
74
Plans in 2018/2019 (2/3)• New Research towards Exa/Post Moore Era
– Two proposals to Japanese Government are under review process (2nd round) (next page)
– Heterogeneous Architectures• CPU, GPU, FPGA, Quantum/Neuromorphic/Custom Chips
– Integration of Simulations/Data/Learning– Mixed Precision/Approximate Computing for Lower
Energy Consumption• JHPCN Project in FY.2018
– New Systems: Oakbridge-II (March 2019), BDEC (Fall 2020)
75
• Accelerating Cambrian Explosion of Computing Principles and Systems– Grant-in-Aid for Scientific Research on Innovative Areas– 9+1 Teams, 1.16 BJPY (FY.2018-2022) (2nd Round
Review)– Leading PI: Satoshi Matsuoka (Tokyo Tech -> RIKEN)
• 4 Teams of Applications and Algorithms• Nakajima (U.Tokyo): Application Development Framework• Iwashita (Hokkaido): High-Performance Algorithms• Katagiri (Nagoya): Low-Power/Low-Accuracy, Acc. Verification, AT• Shimokawabe (U.Tokyo): Machine Learning, DDA
• Innovative Method for Integration of Computational & Data Sciences in Exa-Scale Era– Grant-in-Aid for Scientific Research(S)– 0.20 BJPY (5 years, FY.2018-2022) (2nd Round Review)– PI: Kengo Nakajima (U.Tokyo)
76
BDEC System (1/3)• After July 2020• 60+ PF, ~5.0 MW (w/o A/C), ~1,000 m2
– External Nodes for Data Acquisition/Analysis (EXN)
– Internal Nodes for CSE/Data Analysis (INN)
– Shared File System (SFS)
• External Nodes: EXN, “Data” Nodes– 5-10 PFLOPS
– SFS, Individual File Cache System
– Direct Access to External Network, Real-Time Acquisition
• Internal Nodes: INN, “Compute” Nodes– 50+ PFLOPS, could be different from EXN
– 1+PB Memory (includes NV), 5+ PB/sec
– SFS, Individual File Cache System77
BDEC System (2/3)60+PF, ~5.0 MW (w/o A/C), ~1,000m2
78
Internal NodesINN for Compute50+PF, 1+PB, 5+PB/sec
External NodesEXN for Data
5-10 PF
Shared File System
SFS60+ PB, 1+TB/sec
File Cache System
ExternalResources
ExternalNetwork
BDEC System (3/3)• Architectures of EXN and INN could be different
– EXN could include some of GPU, FPGA, Quantum, Neuromorphic etc.
• Network between EXN and INN– EXN and INN do not necessarily cooperate with each other– EXN is the very special and un-traditional part
• On-Line Data Driven Approach is done on EXN
• Shared File System: SFS– 50+ PB, 1+TB/sec– Both of EXN and INN can access SFS directly
• Possible Applications: Data Driven Approach– Atmosphere-Ocean Simulations with Data Assimilation– Earthquake Simulations with Data Assimilation
• O(103) Points for Measurement, MHz, More Accurate UG Mod el– Real-Time Simulations (e.g. Flood, Earthquakes, Tsunami)
79
80
強震観測網(約2000点) 震度観測網(約4000点)ライフライン事業者(電気、ガス、鉄道etc, )観測網(数万︖)
高速シミュレーションによる、揺れの未来予測リアルタイム地震観測データ同化
本列島の高密度強震・震度観測網と、間事業者等の地震観測データを高速ネットワークで収集、スパコンで地震動シミュレーションを常時い、本列島の揺れのデータ同化を進めるとともに、地震の揺れを観測したら計算を加速して数秒以内に各地の強い揺れ・周期地震動を予測・警告する。その後もデータ同化を続け、予測を更新する。現の緊急地震速報による「震度」値の予測から、実際の「強い揺れ」を高度に予測することで建物等の損傷を察知し、避難対応等に繋げる。
予測1
予測2
予測 n
高速ネットワーク
(更新)
(例)東京ガス超高密度地震防災システム(4000点)
本列島の高密度地震観測データ同化による、強震動・周期地震動の予測、災害軽減
数秒以内
現計算規模(京2000 node、4時間)
[Furumura, ERI/U.Tokyo]
Plans in 2018/2019 (3/3)• Further Development & Library
– Parallel Reordering (Kawai (RIKEN))– Multigrid Solvers
• JHPCN Project in FY.2018
– H-Matrix– Mixed Precision/Approximate Computing
• Student Support– Parallel Multigrid (Nomura)– ChainerMN (Tamura)
• Tutorials, and Classes– OpenFOAM– etc.
81
Parallel reordering for ILU preconditioner
ESSEX II - Equipping Sparse Solvers for Exascale
For supporting exa-scale systems on an ILU preconditioner,
we proposed a hierarchical parallelization of multi-colorings5
Step1 Step2 Step3 Step4
Separating elements
to some parts5
Creating & coloring
a new graph
Number of iterations on each number of
nodes was almost constant on graphene
model (500 million DoF)5
Scattering the
coloring result
Coloring elements, parallely
based on the colored area5
Environment
Oakleaf-FX (SPARC64TM IXfx), 128~4800 nodes
Block ICCG
Parallelizing 10 colors AMC
Block size = 4, Diagonal shifting = 100 82