Activities of IPCC at the Information Technology Center...

Activities of IPCC at the Information Technology Center,The University of Tokyobased on the Showcase Presentation on May 5, 2018

Kengo NakajimaLead, Supercomputing Research DivisionInformation Technology CenterThe University of Tokyo, Japan & JCAHPCDeputy Director, RIKEN R -CCS

Intel Parallel Computing Centers Asia Summit 2018May 10-11, 2018, Chengdu, China

• Overview of IPCC at ITC/U.Tokyo–IPCC since 2014

• Achievements in FY.2017• Future Perspective

2

3Information Technology Center, The University of Tokyo (IPCC since 2014)

– Kengo Nakajima, PI– Toshihiro Hanawa– Takashi Shimokawabe– Akihiro Ida– Masatoshi Kawai (-> RIKEN)– Eishi Arima– Tetsuya Hoshino– Yohei Miki– Kenjiro Taura– Masashi Imano

Earthquake Research Institute, The University of Tokyo

– Takashi Furumura– Tsuyoshi Ichimura– Kohei Fujita

FAU, Germany– Gerhard Wellein

Atmosphere & Ocean Research Institute, The University of Tokyo

– Masaki Satoh– Hiroyasu Hasumi

RIKEN Advanced Institute for Computational Science (AICS)

– Hisashi Yashiro

Japan Agency for Marien-Earth Science & Technology (JAMSTEC)

– Masao Kurogi

Research Organization for Information Science & Technology (RIST)

– Takashi Arakawa

JAMA (Japan Automobile Manufacturers Association)

Supercomputers in ITC/U.Tokyo2 big systems, 6 yr. cycle

4

FY

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Yayoi: Hitachi SR16000/M1IBM Power-7

5459 TFLOPS, 1152 TB

Reedbush, HPEBroadwell + Pascal

1593 PFLOPS

T2, Tokyo140TF, 3153TB

Oakforest-PACSFujitsu, Intel ,NL25PFLOPS, 919.3TB

BDEC System50+ PFLOPS (?)

Oakleaf-FX: Fujitsu PRIMEHPC FX10, SPARC64 IXfx1513 PFLOPS, 150 TB

Oakbridge-FX13652 TFLOPS, 1854 TB

Reedbush-L HPE

1543 PFLOPS

Oakbridge-IIIntel/AMD/P9 CPU only

5+ PFLOPS

Integrated Supercomputer System for Data Analyses & Scientific Simulations

JCAHPC: Tsukuba, Tokyo

Big Data & Extreme Computing

Supercomputer System with Accelerators for Long-Term Executions

Now operating 2 (or 4)systems !!• Oakleaf-FX (Fujitsu PRIMEHPC FX10)

– 1.135 PF, Commercial Version of K, Apr.2012 – Mar.2018

• Oakbridge-FX (Fujitsu PRIMEHPC FX10)– 136.2 TF, for long-time use (up to 168 hr), Apr.2014 – Mar.2018

• Reedbush (HPE, Intel BDW + NVIDIA P100 (Pascal))– Integrated Supercomputer System for Data Analyses &

Scientific Simulations• Jul.2016-Jun.2020

– Our first GPU System, DDN IME (Burst Buffer)– Reedbush-U: CPU only, 420 nodes, 508 TF (Jul.2016)– Reedbush-H: 120 nodes, 2 GPUs/node: 1.42 PF (Mar.2017)– Reedbush-L: 64 nodes, 4 GPUs/node: 1.43 PF (Oct.2017)

• Oakforest-PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL))– JCAHPC (U.Tsukuba & U.Tokyo)– 25 PF, #9 in 50th TOP 500 (Nov.2017) (#2 in Japan)– Omni-Path Architecture, DDN IME (Burst Buffer)

5

JPY (=Watt)/(Peak)GFLOPS RateSmaller is better (efficient)

6

System JPY/GFLOPSOakleaf/Oakbridge-FX (Fujitsu)(Fujitsu PRIMEHPC FX10)

125

Reedbush-U (HPE)(Intel BDW)

62.0

Reedbush-H (HPE)(Intel BDW+NVIDIA P100)

17.1

Oakforest-PACS (Fujitsu)(Intel Xeon Phi/Knights Landing) 16.5

Oakforest-PACS: OFP• Full Operation started on December 1, 2016 • 8,208 Intel Xeon/Phi (KNL), 25 PF Peak Performance

– Fujitsu

• TOP 500 #9 (#2 in Japan), HPCG #6 (#2) (Nov. 2017)• JCAHPC: Joint Center for Advanced High

Performance Computing)– University of Tsukuba– University of Tokyo

• New system is installed at Kashiwa-no-Ha (Leaf of Oak) Campus/U.Tokyo, which is between Tokyo and Tsukuba

– http://jcahpc.jp

77

Software of Oakforest-PACS

• OS: Red Hat Enterprise Linux (Login nodes), CentOS or McKernel (Compute nodes, dynamically switchable)

• McKernel: OS for many-core CPU developed by RIKEN AICS

• Ultra-lightweight OS compared with Linux, no background noise to user program

• Expected to be installed to post-K computer

• Compiler: GCC, Intel Compiler, XcalableMP

• XcalableMP: Parallel programming language developed by RIKEN AICS and University of Tsukuba

• Easy to develop high-performance parallel application by adding directives to original code written by C or Fortran

• Application: Open-source softwares

• : ppOpen-HPC, OpenFOAM, ABINIT-MP, PHASE system, FrontFlow/blue and so on

81st JCAHPC Seminar@U5Tokyo

• ARTED– Ab-initio Electron

Dynamics

• Lattice QCD– Quantum Chrono

Dynamics

• NICAM & COCO– Atmosphere & Ocean

Coupling

• GAMERA/GHYDRA– Earthquake Simulations

• Seism3D– Seismic Wave Propagation

9

Applications on OFP

ppOpen -HPC: OverviewPrimary Target of IPCC Activities

• Application framework with automatic tuning (AT) “pp” : post-peta-scale

• Five-year project (FY.2011-2015) (since April 2011) P.I.: Kengo Nakajima (ITC, The University of Tokyo) Part of “Development of System Software Technologies for

Post-Peta Scale High Performance Computing” funded by JST/CREST (Supervisor: Prof. Mitsuhisa Sato, RIKEN AICS)

• Target: Oakforest-PACS (Original Schedule: FY.2015) could be extended to various types of platforms

10

• Team with 7 institutes, >50 people (5 PDs) from various fields: Co-Design

• Open Source Software http://ppopenhpc.cc.u-tokyo.ac.jp/ English Documents, MIT License

11

12

Featured Developments• ppOpen-AT: AT Language for Loop Optimization• HACApK library for H-matrix comp. in ppOpen-

APPL/BEM (OpenMP/MPI Hybrid Version)– First Open Source Library by OpenMP/MPI Hybrid

• ppOpen-MATH/MP (Coupler for Multiphysics Simulations, Loose Coupling of FEM & FDM)

• Sparse Linear Solvers

13

ESSEX-II: 2nd Phase of ppOpen -HPCJapan -Germany Collaboration (FY.2016-2018)

http://blogs.fau.de/essex/• JST-CREST (J) and DFG-SPPEXA (G)• Equipping Sparse Solvers for Exascale (FY.2013-15)• ESSEX + Tsukuba (Prof. Sakurai) + U.Tokyo

(ppOpen-HPC) = ESSEX-II– Leading PI: Prof. Gerhard Wellein (U. Erlangen)

• Mission of Our Group: Preconditioned Iterative Solver for Eigenvalue Problems in Quantum Science– DLR (German Aerospace Research Center)– U. Wuppertal, U. Greifswald, Germany

14

15

Activities in FY.2016/2017 (1/4)• Linear Solvers/FEM Matrix Assembly in ppOpen-HPC

– Done in FY.2016/2017, Further Optimization Needed• Results of FEM Mat. Assembly

• 3D Seismic Simulations– Seism3D: FDM

• Not yet done

– GOJIRA/GAMERA/GHYDRA: FEM• Done in FY.2016/2017, Excellent Performance• HPC Asia 2018 Best Paper, IPDPS 2018• Not accepted for SC17 Gordon Bell

• ppOpen-MATH/MP (Coupler)– Done in FY.2017/2017– El Nino Simulations on K computer

• Atmosphere-Ocean Coupling (NICAM-COCO) – Done: Serial Opt. in FY.2016, Parallel Opt. in FY.2017– Good Scalability

16

Activities in FY.2016/2017 (2/4)• Multigrid Solver (ppOpen-MATH/MG)

– Done, Further Optimization Needed• HACApK for H-Matrix

– 1st Open Source: OpenMP/MPI Hybrid– Done, Further Optimization Needed– HPC Asia 2018, IPDPS 2018, ICCS 2018

• Other Activities– Chebyshev Filter Diagonalization

• Joint Work with G. Wellein (FAU, Germany) etc. under ESSEX-II Project

• Comparisons of OFP and PizDaint• ISC-HPC 2018 Hans Meuer Award Finalist

– Dynamic Loop Scheduling• Significant Improvement of Performance (20+%)• Problems in Strong Scaling (MPI: Point-to-Point/Collective Comm.)• ICPP 2017 (P2S2)

17

Activities in FY.2016/2017 (3/4)• Other Activities (cont.)

– Algebraic Multigrid Method using Near Kernel Vectors• 3D Solid Mechanics Applications• Preliminary Evaluation on OFP (Flat MPI)• SC17 Research Poster

– ChainerMN• Deep Learning Framework for Distributed Parallel Environments• Developed by PFN, Japan• Implementation to OFP• SIAM PP18 Poster

– OpenFOAM• Widely Used Open Source Applications for CFD• Pipelined Algorithms• Development of OpenMP/MPI Hybrid Version

– Tutorials• OpenFOAM• Introduction to KNL

18

Publications in FY.2017 (1/2)• K. Nakajima, T. Hanawa, Communication-Computation Overlapping with

Dynamic Loop Sched-uling for Preconditioned Parallel Iterative Solvers on Multicore/Manycore Clusters, IEEE Proceed-ings of 10th International Workshop on Parallel Programming Models & Systems Software for High-End Computing (P2S2 2017) in conjunction with the 46th International Conference on Paral-lel Processing (ICPP 2017), Bristol, UK, 2017

• N. Nomura, K. Nakajima, A. Fujii, Robust SA-AMG Solver by Extraction of Near-Kernel Vectors, ACM Proceedings of SC17 (The International Conference for High Performance Computing, Networking, Storage and Analysis) (Research Poster), Denver, CO, USA, 2017

• M. Kawai, A. Ida, K. Nakajima, Hierarchical Parallelization of MulticoloringAlgorithms for Block IC Preconditioners, IEEE Proceedings of the 19th International Conference on High Per-formance Computing and Communications (HPCC 2017), 138-145, Bangkok, Thailand, 2017

• K. Fujita, K. Katsushima, T. Ichimura, M. Horikoshi, K. Nakajima, M. Hori, L. Maddegedara, Wave propagation simulation of complex multi-material problems with fast low-order unstructured finite-element meshing and analysis, ACM Proceedings of HPC Asia 2018, Tokyo, Japan, 2018 (Best Paper Award)

• A. Ida, H. Nakashima, M. Kawai, Parallel Hierarchical Matrices with Block Low-rank Representation on Distributed Memory Computer Systems, ACM Proceedings of HPC Asia 2018, Tokyo, Japan, 2018

19

Publications in FY.2017 (2/2)• M. Horikoshi, L. Meadows, T. Elken, P. Sivakumar, E. Mascarenhas, J.

Erwin, D. Durnov, A. Sannikov, T. Hanawa, T. Boku, Scaling collectives on large clusters using Intel (R) architecture processors and fabric, Intel eXtreme Performance Users Group Workshop in conjunction with HPC Asia 2018, Tokyo, Japan, 2018

• K. Tamura, T. Hanawa, Performance Evaluation of Large-Scale Deep Learning Framework ChainerMN on Oakforest-PACS, SIAM Conference on Parallel Processing for Scientific Computing (PP18), Tokyo, 2018

• T. Hoshino, A. Ida, T. Hanawa, K. Nakajima, Design of Parallel BEM Analyses Framework for SIMD Processors, Proceedings of 2018 International Conference on Computational Science (ICCS 2018), 2018 (in press)

• T. Ichimura, K. Fujita, M. Horikoshi, L. Meadows, K. Nakajima et al., A Fast Scalable Implicit Solver with Concentrated Computation for Nonlinear Time-evolution Problems on Low-order Unstructured Finite Elements, Proceedings of 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), 2018 (in press)

• Akihiro Ida, Lattice H-Matrices on Distributed-Memory Systems, IPDPS 2018, 2018 (in press)

• M. Kreutzer, D. Ernst, A.R. Bishop, H. Fiske, G. Hager, K. Nakajima, G. Wellein, Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs, Proceedings of ISC High Performance 2018 (Hans MeuerAward Finalist), 2018 (in press)

20

Major Invited Talks in FY.2017• K. Nakajima, A. Ida, M. Kawai, T. Katagiri, Development of Large-Scale

Scientific & Engineer-ing Applications on Post-Peta/Exascale Systems using ppOpen-HPC, APSCIT 2017 Annual Meet-ing (Asia Pacific Society for Computing and Information Technology (Sapporo, Hokkaido, Japan, July 29, 2017)

• K. Nakajima, Application Development Framework for Manycore Architectures on Post-Peta/Exascale System, ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems in conjunction with SC17 (The International Conference for High Perfor-manceComputing, Networking, Storage and Analysis) (Denver, CO, USA, November 13, 2017)

• K. Nakajima, Application Development Framework for Manycore Architectures -from Exascale to Post Moore Era-, International Workshop for High Performance Numerical Algorithms and Applications (HPNAA) (Sanya, China, January 8-12, 2018)

• K. Nakajima, Computational Science and Engineering in Exascale and Post Moore Era, JAMSTEC‐DKRZ Workshop: Open Workshop: User Services at Earth Science HPC Centers (Yokohama, Japan, March 19, 2018)

• Overview of IPCC at ITC/U.Tokyo• Achievements in FY.2017• Future Perspective

21

• NICAM-COCO– Yashiro, Kurogi, Arakawa, Nakajima

• Earthquake Simulations • Chebyshev Filter Diagonalization • Multigrid Method• H-Matrix• Dynamic Loop Scheduling• Matrix Assembly in FEM• ChainerMN• OpenFOAM

22

Atmosphere -Ocean Coupling on OFP by NICAM/COCO/ppOpen-MATH/MP

• High-resolution global atmosphere-ocean coupled simulation by NICAM and COCO (Ocean Simulation) through ppOpen-MATH/MP on the K computer is achieved. – ppOpen-MATH/MP is a coupling software for the

models employing various discretization method.

• An O(km)-mesh NICAM-COCO coupled simulation is planned on the Oakforest-PACS system. – A big challenge for optimization of the codes on new

Intel Xeon Phi processor– New insights for understanding of global climate

dynamics

23

[C/O M5 Satoh (AORI/UTokyo)@SC16]

Dataflow of ppOpen -MATH/MP*24

[M. Matsumoto et al. 2015]

25

Works in FY.2016• NICAM

– Original Code (40K+ lines): MPI + Auto Parallel/K compiler– OpenMP statements applied since November 2016– Problems

• Fork-Join Overhead, Load Imbalancing• MPI Initialization Overhead• Numerical Instability

• COCO– OpenMP + MPI– Some of the operations were not parallelized– Generally, good improvement

• Coupling– 56km NICAM + 0.25-deg COCO for 1-day: done (not stable)– 3.5km NUCAM + 0.10-deg COCO using 7,230 nodes of

OFP (for 1-month in 3.5-day computations)

26

Optimization of NICAM56-km: 32-nodes, 640-proc’s

28-km: 128-nodes, 2,560-proc’ssupported by Engineers of Intel

0.0

2.0

4.0

6.0

8.0

10.0

sec.

Feb.07-56km

Feb.07-28km

Feb.22-56km

Feb.10-28km

27

Code on Feb. 8, 2017: Generally, codes with fewer t hreads are faster

OLD

COCO (1440x1200x62)

60 node

Feb. 9

COCO (1440x1200x62)

60 node

Latest

COCO (1440x1200x62)

60 node

64 MPI (flat) 8OMPx8MPI 64 MPI (flat) 8OMPx8MPI 64 MPI (flat) 8OMPx8MPI

Main loop

(960 step)

main 1175208 146555 106.399 126.507 97.118 121.405

tracer 555145 64534 53.311 57.369 50.301 59.403

brcliic 12556 195074 12.552 17.571 12.449 18.026

brtropic 135666 195895 10.9 17.549 11.591 17.267

surface flux 85537 85076 9.629 10.804 4.84 4.744

vel diag.+

vertical diff. 145833 175129 14.529 14.469 11.749 13.34

sea ice 45152 65772 4.097 6.847 4.547 6.76

history out 25781 05088 0.085 0.073 0.088 0.075

other 55535 115175 1.297 1.825 1.555 1.788

Initial setup (I/O) 535431 285309 455365 255282 43.879 24.479

Terminal treatment (I/O) 995434 785926 985156 735943 98.363 77.291

From MPI_INIT to

MPI_FINALIZE 2705125 25357962495928 2255736

239.37 223.177

Elapse time 305 260 284 232 274 229

By merging the code which I received from Horikoshi-san, COCO became faster5

Virtual MPI rank was introduced

OpenMP directive was refined5

Time measurement of 0.25°COCO

28

Time measurement of 0.1°COCO

0.25°COCO

1440x1200x62

0.1°COCO

3600x3000x62

60 node 375 node 1125 node

64 MPI (flat) 8OMPx8MPI 64 MPI (flat) 4OMPx16MPI 8OMPx8MPI 4OMPx16MPI

Main loop

(960 step)

main 97.118 121.405 1175522 1145275 1315839 775346

tracer 50.301 59.403 505444 495215 605758 205603

brcliic 12.449 18.026 135087 145614 185257 45950

brtropic 11.591 17.267 125431 175975 205176 165996

surface flux 4.84 4.744 215189 115026 95268 195333

vel diag.+

vertical diff. 11.749 13.34 125230 115852 135643 65804

sea ice 4.547 6.76 55088 75087 75592 65264

history out 0.088 0.075 05235 05148 05172 05323

other 1.555 1.788 25818 25358 15975 25073

Initial setup (I/O) 43.879 24.479 2025445 1775477 1705517 1905236

Terminal treatment (I/O) 98.363 77.291 7975995 4395936 4385548 10535745

From MPI_INIT to

MPI_FINALIZE 239.37 223.177 11175971 7315690 7405907 13215333

Elapse time 274 229 1365 751 752 1401

Hybrid is faster than flat MPI5

It is only 1548 times faster when 3 times nodes are used5

For 1 month calculation, it takes 05246 wall clock hour (1457 minutes) by using 375 nodes5

Considering NICAM takes 72 wall clock hour, 1528 (=375×05246/72) may be the ideal node number for

COCO (though memory is insufficient) in NICOCO simulation5

29

NICAM-COCO Coupling with Higher-Resolution

• Results in FY.2017– Strong Scaling: 224km/56km atmos. + 1.0deg ocean– One-Year Simulation for 14km (NICAM) + 0.1 deg (COCO)– On-Hour Simulation for 3.5km (NICAM) + 0.1 deg (COCO)

• Optimization & Performance Evaluation of MPMD Climate Application (NICAM (Atmosphere) + COCO (Ocean)) on OFP with ppOpen-MATH/MP

El Niño Simulations (1/3)U.Tokyo , RIKEN

Mechanism of the Abrupt Terminate of Super El Niño in 1997/1998 has been revealed by Atmosphere-Ocean Coupling Simulations for the Entire Earth

using ppOpen -HPC on the K computer

31


Mechanism of the Abrupt Terminate of Super El Niño in 1997/1998 has been revealed by Atmosphere-Ocean Coupling Simulations for the Entire Earth

using ppOpen -HPC on the K computer

32


• A Madden-Julian Oscillation (MJO) event remotely accelerates ocean upwelling to abruptly terminate the 1997/1998 super El Niño

• NICAM (Atmosphere)-COCO (Ocean) Coupling by ppOpen-MATH/MP

33

• Data Assimilation + Simulations– Integration of Simulations + Data Analyses

• NICAM: 14km – COCO 0.25-1.00 deg. on K• Next Target: 3.5km -0.10 deg. (5B Meshes) on OFP

ppOpen-MATH/MP

NICAM Strong Scaling in FY.2017

• Good Scalability if Grid Size is Sufficiently Large

0.4

1.12.7

4.1

0.01

0.03

0.12

0.401.01 1.36

0.001

0.01

0.1

1

10

100

1 10 100 1000 10000 100000

SY

PD

(SIM

ULA

TIO

N Y

EA

R

PE

R W

ALL

-CLO

CK

DA

Y)

# OF MPI PROCESS

224km(ideal) 224km 56km(ideal) 56km

Challenges in FY.2018: NICAM -COCO Coupling with Higher-Resolutions

• 3.5km (NICAM) + 0.1 deg (COCO) (5B Meshes)• I/O Coupling• NICAM/COCO/IO on a Same CPU for Efficient

Use of Computational Resources– Mostly, computations are for atmosphere (NICAM)

• NICAM-COCO• Earthquake Simulations

– Ichimura, Fujita, Furumura, Nakajima– HPC Asia 2018 (Best Paper), IPDPS 2018

• Chebyshev Filter Diagonalization • Multigrid Method• H-Matrix• Dynamic Loop Scheduling• Matrix Assembly in FEM• ChainerMN• OpenFOAM

36

Code by Prof. T. Ichimura & K. Fujita (ERI/U.Tokyo ) for Earthquake Simulations• Finite Element Method

Tetrahedral Elements (2nd Order)Nonlinear/Linear, Dynamic/Static Solid MechanicsMixed Precision, EBE-based Multigrid

• SC14 & SC15: Gordon Bell Finalist• SC16: Best Poster Award, Best Paper Award

(WACCPPD 2016) (GPU)• SC17: Best Poster Award, WACCPPD 2017: Best

Poster Award• HPC Asia 2018: Best Paper Award• IPDPS 2018 Paper

37

Earthquake Wave Propagation Analysis

38

a) Earthquake wave propagation

-7 km

0 km

c) Resident evacuation

b) City response simulation

Shinjuku

Two million agents evacuating to nearest safe site

Tokyo station

Ikebukuro

Shibuya

Shinbashi

UenoEarthquake Post earthquake

Background for Designing New Algorithm

• Wider SIMD in recent architecture– 128 bit: 2 in double precision (2 in single precision) on K

computer– 512 bit: 8 in double precision (16 in single precision) on OFP

[Post K computer also 512 bit]

• Smaller B/F ratio (memory transfer capability)– (5.3 Pbyte/s) / (10.5 PFLOPS) = 0.5 for K computer– (4 PByte/s) / 25 PFLOPS) = 0.16 for Oakforest-PACS

• Random access and data transfer in sparse matrix-vector product kernels becomes bottleneck

• How to use these wider SIMD in finite-element methods while circumventing memory transfer bottlenecks?

39

Design of New Algorithm: GAMERA• Introduce time-parallel concentrated computation

algorithm in solving Kui = fi (i=1,2, …, nt)– Predict solution of future time steps by solving several time

steps together– Reduces solver iterations by increasing compute cost per

iteration– Total FLOP count is not changed much, however, a more

efficient kernel can be used

40

Kernel in standard solver:Sparse matrix vector product kernel (SpMV)

=

fi fi+1 fi+2

Future time stepsCurrent time step

Kernel in new solver: less random access(GSpMV kernel)

fi+3

……

Contiguous in memory (SIMD efficient )

Contiguous in memory(SIMD efficient )

ui ui+1 ui+2 ui+3

=

41

1: set Ă 2: for( )3: guess using standard predictor4: set using 5: solve Ă using initial solution (Computed using iterative solver with SpMV kernel)6:

1: set Ă and Ă ( )2: for( )3: set using 4: guess using standard predictor5:

6: while (

) do

7: guess using ( )8: refine solution Ă with initial solution ( ) (Computed using iterative

solver with GSpMV kernel)10: 11: 11:

Standard solver algorithm

Developed algorithm (in case of 4 vectors)

Performance of New AlgorithmHPC Asia 2018 Best Paper

42

• Test on linear wave propagation problem solved using a conjugate gradient solver using 3x3 block diagonal preconditioning (double precision)– Number of iterations reduced from 9,068 to 3,541 by using

the developed method

20 coresx 2 socketsx 2 nodes@ 2.4 GHz

12 coresx 2 socketsx 4 nodes@ 2.6 GHz

68 coresx 2 nodes@ 1.4 GHz

8 coresx 8nodes@ 2.0 GHz

Peak DP FLOPS 3072 GFLOPS 3993.6 GFLOPS 6092.8 GFLOPS 1024 GFLOPS

Peak memory bandwidth

512 GB/s 544 GB/s 980 GB/s (MCDRAM)230 GB/s (DDR4)

512 GB/s

326.2 314.5 233.0

365.6

229.4 224.8 138.0

244.3

0

100

200

300

400

Skylake-SP XeonGold

Haswell Xeon E5 Oakforest-PACS K computer

Without time-parallel With time-parallel

Tota

l ela

psed

tim

e (s

)

1.42 x faster

1.40 x faster

1.50 x faster

1.69 x faster

Weak Scaling on Oakforest-PACSHPC Asia 2018 Best Paper

• Scalability improved due to reduced number of communication messages– 91.3% weak scaling efficiency from 256 to 8,192 nodes– x1.55 speedup by using new algorithm on full system

43

0

50

100

150

200

250

300

256

Without time-parallelWith time-parallel

153.

3 s

229.

2 s

162.

2 s

255.

3 s

Number of compute nodes

512 1024 2048 4096

151.

6 s

230.

8 s

152.

4 s

236.

5 s

159.

2 s

242.

6 s

168.

0 s

262.

0 s

8192

Tota

l ela

psed

tim

e (s

)

Strong Scaling on Oakforest-PACS

• 1.5–1.6 times speedup for all node counts tested

44

16

32

64

128

256

512

128 256 512 1024 2048

Without time-parallel

With time-parallel


Tota

l ela

psed

tim

e (s

)

OFP is better than K

for Weak Scaling but …

45

0

50

100

150

200

250

300

256

Without time-parallelWith time-parallel

153.

3 s

229.

2 s

162.

2 s

255.

3 s


512 1024 2048 4096

151.

6 s

230.

8 s

152.

4 s

236.

5 s

159.

2 s

242.

6 s

168.

0 s

262.

0 s

8192To

tal e

laps

ed t

ime

(s)

OFP

0

50

100

150

200

250

300

350

400

450

500

1024

251.

2 s

[18.

2%] (

69.9

%)

356.

1 s

[7.6

%] (

70.6

%)

Number of compute nodes2048 4096 8192 16384

311.

5 s

[14.

5%] (

55.7

%)

454.

9 s

[6.0

%] (

55.5

%)

32768

Tota

l ela

psed

tim

e (s

) K

4.19 PF

3.07 PF

• NICAM-COCO• Earthquake Simulations • Chebyshev Filter Diagonalization • Multigrid Method

– GMG: Geometric Multigrid• Nakajima

– SA-AMG: Smoothed Aggregation Algebraic Multigrid

• H-Matrix• Dynamic Loop Scheduling• Matrix Assembly in FEM• ChainerMN• OpenFOAM 46

Optimization of Parallel Comm.CGA(Coarse Grid Aggregation) -> hCGA

47

Level=1

Level=2

Level=m-3

Level=m-2

Level=m-1

Level=mMesh # foreach MPI= 1

Fine

Coarse Coarse grid solver on a single core (further multigrid)

Original CGA

hCGA

GW Flow Simulation with up to 4,096 nodes on

Fujitsu FX10 (GMG -CG)up to 17,179,869,184 meshes (64 3

meshes/core)

48

5.0

7.5

10.0

12.5

15.0

100 1000 10000 100000

sec.

CORE#

CGA

hCGA

x1.61

0

20

40

60

80

100

120

1024 8192 65536

Par

alle

l Per

form

ance

(%)

CORE#

Flat MPI:C3 Flat MPI:C4

x6.27

hCGAStrong Scaling

hCGAWeak Scaling

Preliminary Results on OFP with

8,192 nodesup to 34B

DOF

Performance of Coarse Grid Solver is not

good

49

2.5

5.0

7.5

10.0

12.5

15.0

1.E+01 1.E+02 1.E+03 1.E+04

sec.

Node#

Flat MPI:FX10Flat MPI:FX10-hCGAHB 4x16: OFPHB 4x16: OFP-hCGA

2.0

4.0

6.0

8.0

1.E+02 1.E+03 1.E+04

sec.

Node#

HB 4x16: OFPHB 4x16: OFP-hCGAHB 8x8: OFPHB 16x4: OFP

Preliminary Results on OFP with

8,192 nodesup to 34B

DOF

Performance of Coarse Grid Solver is not

good

50

2.0

4.0

6.0

8.0

1.E+02 1.E+03 1.E+04

sec.

Node#

HB 4x16: OFPHB 4x16: OFP-hCGAHB 8x8: OFPHB 16x4: OFP

5.0

7.5

10.0

12.5

15.0

100 1000 10000 100000

sec.

CORE#

Flat MPI:C3HB 4x4:C3HB 8x2:C3HB 16x1:C3

FX10

OFP

AM-hCGA: Adaptive Multilevel hCGAhCGA: only 2-layers

More hierarchical levels needed for more process #

51

1st

Layer

2nd Layer

3rd Layer

• NICAM-COCO• Earthquake Simulations • Chebyshev Filter Diagonalization • Multigrid Method• H-Matrix

– Ida, Hoshino– HPC Asia 2018, IPDPS 2018, ICCS 2018

• Dynamic Loop Scheduling• Matrix Assembly in FEM• ChainerMN• OpenFOAM

52

Approximation for dense matrices・ℋ-matrices："($%) ⇒ "($()*$)

53

Enhancement of ACApK ℋACApK : Library for simulations using BEM

Top 3 achievements in 2017

Example analyses

Porting to clusters equipped KNL・GPU

New framework design for SIMD vectorization

Improvement of ℋ-matrices for MPP

Future work

Programming model : MPI+OpenMP

Complex ℋ-matrices arithmetic on clusters equipped KNL・GPU・Matrix-matrix multiplication ・LU factorisation ・Matrix inversionEigenvalue problems (Application to deep learning)

Ԃ Enhancement issue Available functions・ℋ-matrix construction・ℋ-matrix-vector・Iterative linear solver

Parallel ℋ-matrix generation

Parallel ℋ-matrix linear solver

User-defined function

ℋACApK API

Library Design of ACApK

User Program

Main

real(8) function user_func(i,j,st_bemv)integer :: i,jtype(hacapkInput) :: st_bemv

end function user_func

i

j

calculate i,j element of coefficient matrix

subroutine hmat_generation(st_bemv)type(hacapkInput) :: st_bemv!$omp parallel do do submat = 1, numSubmatdo j = stl, enl

!$omp simddo i = str, enra(i,j) = user_func(i,j,st_bemv)

end do end do

end do end subroutine hmat_generation

ℋACApK API calls user-defined function

User program calls ℋACApKAPI

User-defined function is black box. Is this SIMD directive work well?

User-defined function depends on target integral equation

T1

T2

T3

54

New framework design for SIMD vectorization

real(8) function vectorize_func(arg1,arg2,…)!$omp declare simd(vectorize_func) &!$omp simdlen(SIMDLENGTH) &!$omp linear(ref(arg1, arg2, …)) real(8), intent(in) :: arg1,arg2,…

! User defined calculation

end function user_func

subroutine hmat_generation(st_bemv)type(hacapkInput) :: st_bemvreal(8),dimension(SIMDLENGTH) :: ansreal(8),dimension(SIMDLENGTH) :: arg1,arg2,...

do j = stl, enldo i = str, enr, SIMDLENGTH

ii = 1do jj = i, min(i+SIMDLENGTH-1, enr)

call set_args(i,j,st_bemv,arg1(ii),arg2(ii),…)ii = ii+1

end do

!$omp simddo ii = 1, SIMDLENGTH

ans(ii) = vectorize_func(arg1(ii),arg2(ii),…)end do ii = 1do jj=i,min(i+SIMDLENGTH-1, enr)

a(i,j) = ans(ii)ii = ii+1

end do end do

end do

end subroutine hmat_generation

subroutine set_args(i,j,st_bemv,arg1,arg2,...)integer, intent(in) :: i, jtype(hacapkInput) :: st_bemvreal(8), intent(out) :: arg1,arg2,...

! Data copy from structure to scalar arguments

end subroutine set_args

2 interfaces for vectorization

Data access

Computation This loop is obviously vectorizable

This loop is sequentially executed

Framework automatically inserts the declare simd directive

Auto-transformation

T. Hoshino et al. “Design of Parallel BEM Analyses Framework for SIMD Processors” (ICCS 2018)

Insert declare SIMD directive etc. Users don’t have to consider SIMD

55

Numerical Evaluations

１V 0.25m

Ground

,[.]() ∶ 1

23 4. 5 d5 ∈ Ω

Ω

D[.]() ∶ 149(4)

23 4 : . 5 d5 ∈ ΩΩ

Test model of electrostatic field analysis

Perfect conducting sphere

Dielectric sphere

User-defined functions depend on these integral equations

02468

10121416

PerfectBDW

DierectricBDW

PerfectKNL

DierectricKNL

Elap

sed

time

[sec

]

Coefficient H-matrix generation on BDW and KNL

Original SIMD design

2.2x

1.9x

4.3x

4.1x

Evaluation Environments

• including branch divergence

BDW : Intel Xeon E5-2695 v4, 18 core

KNL : Intel Xeon Phi 7250, 68 core

Compiler : Intel compiler 18.0.1 • -qopenmp -O3 -ipo -align array64byte

-xAVX2 (BDW) –xMIC-AVX512 (KNL)

Performance comparison BDW : approximately 2x speedup

KNL : over 4x speedup

• In the case of dense matrix generation, new design achieved at most 6.6x speedup

56

• NICAM-COCO• Earthquake Simulations • Chebyshev Filter Diagonalization • Multigrid Method• H-Matrix• Dynamic Loop Scheduling

– Nakajima– ICPP 2017 (P2S2)

• Matrix Assembly in FEM• ChainerMN• OpenFOAM

57

Comm.-Comp. Overlapping(CC-Overlapping): Static

58

call MPI_Isendcall MPI_Irecv

do i= 1, Ninn(calculations)

enddocall MPI_Waitall

do i= Ninn+1, Nall(calculationas)

enddoGood for StencilNot so Effective SpMV

Pure Internal Meshes

External (HALO) Meshes

Internal Meshes on Boundary’s(Boundary Meshes)

Comm.-Comp. Overlapping+ Dynamic Loop Scheduling: Dynamic

CC-Overlapping 59

call MPI_Isendcall MPI_Irecvcall MPI_Waitall

do i= 1, Ninn(calculations)

enddo

do i= Ninn+1, Nall(calculationas)

enddo

Master

Dynamic

Static

Pure Internal Meshes

External (HALO) Meshes

Internal Meshes on Boundary’s(Boundary Meshes)

60

Dynamic Loop Scheduling• “dynamic”

• “!$omp master～!$omp end master”!$omp parallel private (neib,j,k,i,X1,X2,X3,WVAL1,WVAL2,WVAL3)!$omp& private (istart,inum,ii,ierr)

!$omp master Communication is done by the master thread (#0)!C!C– Send & Recv.(…)

call MPI_WAITALL (2*NEIBPETOT, req1, sta1, ierr)!$omp end master

!C The master thread can join computing of internal!C-- Pure Internal Nodes nodes after the completion of communication

!$omp do schedule (dynamic,200) Chunk Size= 200do j= 1, Ninn(…)

enddo!C!C-- Boundary Nodes Computing for boundary nodes are by all threads

!$omp do default: !$omp do schedule (static)do j= Ninn+1, N(…)

enddo

!$omp end parallel

Ina, T., Asahi, Y., Idomura, Y., Development of optimization of stencil calculation on Tera-flops many-core architecture, IPSJ SIG Technical Reports 2015-HPC-152-10, 2015 (in Japanese)

61

Code Name KNL BDW FX10

Architecture

Intel Xeon Phi 7250

(Knights Landing)

Intel Xeon E5-2695 v4

(Broadwell-EP)

SPARC IX fx

Frequency (GHz) 1.40 2.10 1.848

Core # (Max Thread #) 68 (272) 18 (18) 16 (16)

Peak Performance (GFLOPS)

3,046.4 604.8 236.5

Memory (GB) MCDRAM: 16DDR4: 96 128 32

Memory Bandwidth(GB/sec., Stream Triad)

MCDRAM: 490

DDR4: 80.165.5 64.7

Out-of-Order Y Y N

System Oakforest-PACS Reedbush-U Oakleaf-FX

62

Code Name KNL BDW FX10

Architecture

Intel Xeon Phi 7250

(Knights Landing)

Intel Xeon E5-2695 v4

(Broadwell-EP)SPARC IX fx

Frequency (GHz) 1.40 2.10 1.848

Core # (Max Thread #) 68 (272) 18 (18) 16 (16)

Peak Performance (GFLOPS)/core

44.8 33.6 14.8

Memory Bandwidth(GB/sec., Stream Triad)/core

MCDRAM: 7.21

DDR4: 1.243.64 4.04

Out-of-Order Y Y N

Network Omni-PathArchitecture

Mellanox EDRInfiniband

Tofu6D Torus

63

GeoFEM/Cube• Parallel FEM Code (&

Benchmarks)• 3D-Static-Elastic-Linear (Solid

Mechanics)• Performance of Parallel

Preconditioned Iterative Solvers– 3D Tri-linear Hexahedral

Elements– Block Diagonal LU + CG– Fortran90+MPI+OpenMP– Distributed Data Structure

– MPI，OpenMP，OpenMP/MPI Hybrid

– Block CRS Format

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-direction @ z=Zmax

(Ny-1) elementsNy nodes

(Nx-1) elementsNx nodes

(Nz-1) elementsNz nodes

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-direction @ z=Zmax

(Ny-1) elementsNy nodes

(Nx-1) elementsNx nodes

(Nz-1) elementsNz nodes

( )1,1,1 −+−

( ) ( )1,1,1,, −−−=ζηξ1 2

34

5 6

78

( )1,1,1 −−+

( )1,1,1 −++

( )1,1,1 +−− ( )1,1,1 +−+

( )1,1,1 +++( )1,1,1 ++−

i

j

( )81,

333231

232221

131211

…=

ee

jiji

i

jiji

ji

aaa

aaa

aaa

eeeeejei

ejeiejeiejei

eeejeiee

Preliminary Results: Best Cases3,840 cores, 368,640,000 DOF

Improvement of CG from Original Cases

64

-10.00

-5.00

0.00

5.00

10.00

15.00

20.00

25.00

Spe

ed-U

p (%

)

FX10: HB 16x1 BDW: HB 8x2 KNL: HB 64x2 (2T)

Preliminary Results: Original Cases3,840 cores, 368,640,000 DOF

Communication Overhead by Collective/Point-to-Point Communications

65

0%

20%

40%

60%

80%

100%

FX10: HB 16x1 BDW: HB 8x2 KNL: HB 64x2(2T)

Rest Send/Recv Allreduce

7.5% 4.0% 14.6%

66

Features

Effect of Dynamic

Scheduling

Optimum Chunk Size

Notes

FX10 Medium 100 Memory Throughput

BDW Small 500+ Low Comm. OverheadSmall number of threads.

KNL Large 300-500

Effects are significant for HB 64x2, 128x1, where loss of performance by communications on master thread is rather smaller.

67

Target Problem

• 2563 FEM Nodes, 50,331,648 DOF• Strong Scaling

– FX10: 2-1,024 nodes (32-16,384 cores)

– BDW: 2-512 sockets, 1-256 nodes (32-8,192 cores)• Reedbush-U has only 420 nodes of BDW

– KNL: 4-256 nodes (256-16,384 cores)

Core#

Spe

ed-U

p

Ideal Line= 100%< 100%

> 100%• Parallel Performance

– 100%: on the ideal line

– < 100%: BELOW

– > 100%: ABOVE

Strong ScalingParallel Performance

(%)BEST case for each

HB MxN50,331,648 DOF

Computation Time of Flat MPI at Min.# cores:

100%

68

0.0

20.0

40.0

60.0

80.0

100.0

120.0

32 64 128 256 512 1024 2048 4096 8192 16384

Par

alle

l Per

form

ance

(%)

Total Core #

Flat MPI: Original HB 2x8: Csz=500 HB 4x4: Csz=100HB 8x2: Csz=100 HB 16x1: Csz=500

0.0

20.0

40.0

60.0

80.0

100.0

120.0

32 64 128 256 512 1024 2048 4096 8192

Par

alle

l Per

form

ance

(%)

Total Core #

Flat MPI: Static HB 2x8: Csz=500 HB 4x4: Csz=500HB 8x2: Csz=100 HB 16x1: Csz=500

0.0

20.0

40.0

60.0

80.0

100.0

120.0

256 512 1024 2048 4096 8192 16384

Par

alle

l Per

form

ance

(%)

Total Core #

Flat MPI: Original HB 2x32: Csz=500 HB 4x16: Csz=100 HB 8x8: Csz=100HB 16x4: Csz=500 HB 32x2: Csz=500 HB 64x1: Csz=500

FX10

BDW

KNL

BDW (Reedbush): IB -EDRKNK (OFP): OPA

Problems of Performance of MPI on OPA

0.0

20.0

40.0

60.0

80.0

100.0

120.0

32 64 128 256 512 1024 2048 4096 8192 16384

Par

alle

l Per

form

ance

(%)

Total Core #

HB 16x1: Original Static Csz=100 Csz=500

Strong ScalingParallel Performance

(%)Effect of Dynamic Loop

Scheduling50,331,648 DOF

Computation Time of Flat MPI at Min.# cores:

100%

69

0.0

20.0

40.0

60.0

80.0

100.0

120.0

32 64 128 256 512 1024 2048 4096 8192

Par

alle

l Per

form

ance

(%)

Total Core #

HB 8x2: Original Static Csz=100 Csz=500

FX10: HB 16x1

BDW: HB 8x2

0.0

20.0

40.0

60.0

80.0

100.0

120.0

256 512 1024 2048 4096 8192 16384

Par

alle

l Per

form

ance

(%)

HB 64x1/1T: Original Static Csz=100 Csz=500

KNL: HB 64x1 (1T)

Effect of Dynamic Loop Schedulingwith more than 8,192 cores

• FX10: 20%-40%• BDW: 6%-10%• KNL with HB 8×8 (1T): 20%-30%• KNL with HB 64×1 (1T): 40%-50%

Summary• Excellent Performance Improvement by CC-

Overlapping + Dynamic Loop Scheduling• OFP

– Problems in Strong Scaling: General Issues

– Performance of MPI_Allreduce for Many MPI Processes

– Improvement of Intel MPI• Collaboration with Intel Engineers

• Multiple End Point

70

Preliminary Results of Multiple EPGood for Larger-Scale Problems

NOT good for Strong Scaling

71

• Overview of IPCC at ITC/U.Tokyo• Achievements in FY.2017• Future Perspective

72

Near Future Plan of ITC/U.Tokyo• Development of New Research Area needed

– Mostly, CSE (Computational Science & Engineering) – Data Analysis, Deep Learning, Artificial Intelligence

• Genome Analysis, Medical Image Recognition: already started: U.TokyoHospital, Research Organization for Genome Medical Sci.

• Real-time data analysis for GOSAT satellite (2008-2014) on T2K

• Integration of CSE & Data Science– Users in Atmosphere/Ocean Sci., Earthquake Sci./Eng.,

Fluid/Structure Simulations, Material Sci., utilize observed/experimental data for validation of simulations.

– Development of Method & Infrastructure for Efficient Utilization of “Big Data”

– Data Driven Approach

• BDEC (Big Data & Extreme Computing) with 50+PF after Fall 2020– Reedbush is the prototype of the BDEC System

73

Plans in 2018/2019 (1/3)• Further Optimization of Sparse Linear Solvers

– Serial/Parallel Performance– Point-to-Point/Collective Communications in MPI

• Improvement of MPI: e.g. Multiple End Points (with Dr. Horikoshi)

• Large-Scale Simulations– NICAM-COCO (Atmosphere-Ocean)

• El-Nino Simulations with 3.5km-1-deg Model

– Seism3D (Seismic Wave Propagation, FDM)• Loop-Tiling for Stencil-Computations in FDM• Space + Time• Higher Density of Computation

74

Plans in 2018/2019 (2/3)• New Research towards Exa/Post Moore Era

– Two proposals to Japanese Government are under review process (2nd round) (next page)

– Heterogeneous Architectures• CPU, GPU, FPGA, Quantum/Neuromorphic/Custom Chips

– Integration of Simulations/Data/Learning– Mixed Precision/Approximate Computing for Lower

Energy Consumption• JHPCN Project in FY.2018

– New Systems: Oakbridge-II (March 2019), BDEC (Fall 2020)

75

• Accelerating Cambrian Explosion of Computing Principles and Systems– Grant-in-Aid for Scientific Research on Innovative Areas– 9＋1 Teams, 1.16 BJPY (FY.2018-2022) (2nd Round

Review)– Leading PI: Satoshi Matsuoka (Tokyo Tech -> RIKEN)

• 4 Teams of Applications and Algorithms• Nakajima (U.Tokyo): Application Development Framework• Iwashita (Hokkaido): High-Performance Algorithms• Katagiri (Nagoya): Low-Power/Low-Accuracy, Acc. Verification, AT• Shimokawabe (U.Tokyo): Machine Learning, DDA

• Innovative Method for Integration of Computational & Data Sciences in Exa-Scale Era– Grant-in-Aid for Scientific Research(S)– 0.20 BJPY (5 years, FY.2018-2022) (2nd Round Review)– PI: Kengo Nakajima (U.Tokyo)

76

BDEC System (1/3)• After July 2020• 60+ PF, ~5.0 MW (w/o A/C), ~1,000 m2

– External Nodes for Data Acquisition/Analysis (EXN)

– Internal Nodes for CSE/Data Analysis (INN)

– Shared File System (SFS)

• External Nodes: EXN, “Data” Nodes– 5-10 PFLOPS

– SFS, Individual File Cache System

– Direct Access to External Network, Real-Time Acquisition

• Internal Nodes: INN, “Compute” Nodes– 50+ PFLOPS, could be different from EXN

– 1+PB Memory (includes NV), 5+ PB/sec

– SFS, Individual File Cache System77

BDEC System (2/3)60+PF, ~5.0 MW (w/o A/C), ~1,000m2

78

Internal NodesINN for Compute50+PF, 1+PB, 5+PB/sec

External NodesEXN for Data

5-10 PF

Shared File System

SFS60+ PB, 1+TB/sec

File Cache System

ExternalResources

ExternalNetwork

BDEC System (3/3)• Architectures of EXN and INN could be different

– EXN could include some of GPU, FPGA, Quantum, Neuromorphic etc.

• Network between EXN and INN– EXN and INN do not necessarily cooperate with each other– EXN is the very special and un-traditional part

• On-Line Data Driven Approach is done on EXN

• Shared File System: SFS– 50+ PB, 1+TB/sec– Both of EXN and INN can access SFS directly

• Possible Applications: Data Driven Approach– Atmosphere-Ocean Simulations with Data Assimilation– Earthquake Simulations with Data Assimilation

• O(103) Points for Measurement, MHz, More Accurate UG Mod el– Real-Time Simulations (e.g. Flood, Earthquakes, Tsunami)

79

80

強震観測網（約2000点）震度観測網（約4000点）ライフライン事業者（電気、ガス、鉄道etc, ）観測網（数万︖）

高速シミュレーションによる、揺れの未来予測リアルタイム地震観測データ同化

本列島の高密度強震・震度観測網と、間事業者等の地震観測データを高速ネットワークで収集、スパコンで地震動シミュレーションを常時い、本列島の揺れのデータ同化を進めるとともに、地震の揺れを観測したら計算を加速して数秒以内に各地の強い揺れ・周期地震動を予測・警告する。その後もデータ同化を続け、予測を更新する。現の緊急地震速報による「震度」値の予測から、実際の「強い揺れ」を高度に予測することで建物等の損傷を察知し、避難対応等に繋げる。

予測１

予測２

予測 n

高速ネットワーク

（更新）

（例）東京ガス超高密度地震防災システム（4000点）

本列島の高密度地震観測データ同化による、強震動・周期地震動の予測、災害軽減

数秒以内

現計算規模（京2000 node、4時間）

[Furumura, ERI/U.Tokyo]

Plans in 2018/2019 (3/3)• Further Development & Library

– Parallel Reordering (Kawai (RIKEN))– Multigrid Solvers

• JHPCN Project in FY.2018

– H-Matrix– Mixed Precision/Approximate Computing

• Student Support– Parallel Multigrid (Nomura)– ChainerMN (Tamura)

• Tutorials, and Classes– OpenFOAM– etc.

81

Parallel reordering for ILU preconditioner

ESSEX II - Equipping Sparse Solvers for Exascale

For supporting exa-scale systems on an ILU preconditioner,

we proposed a hierarchical parallelization of multi-colorings5

Step1 Step2 Step3 Step4

Separating elements

to some parts5

Creating & coloring

a new graph

Number of iterations on each number of

nodes was almost constant on graphene

model (500 million DoF)5

Scattering the

coloring result

Coloring elements, parallely

based on the colored area5

Environment

Oakleaf-FX (SPARC64TM IXfx), 128~4800 nodes

Block ICCG

Parallelizing 10 colors AMC

Block size = 4, Diagonal shifting = 100 82

Date post:	11-Mar-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times