+ All Categories
Home > Documents > Preliminary Implementation of GRAPES global model Sunway ...

Preliminary Implementation of GRAPES global model Sunway ...

Date post: 08-Jan-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
57
Preliminary Implementation of GRAPES global model on Sunway Taihu light Zhiyan JIN Numerical Weather Prediction Center, China Meteorological Administration Wei Xue, Ping Xu, Hongsong Meng, Yongbin Jiang, Zhao Liu Tsinghua University and National Supercomputing Center in Wuxi 18 th Workshop on HPC in Meteorology, ECMWF, 24‐28 September 2018
Transcript
Page 1: Preliminary Implementation of GRAPES global model Sunway ...

Preliminary Implementation of GRAPES global model on Sunway Taihu light

Zhiyan JINNumerical Weather Prediction Center, China Meteorological Administration

Wei Xue, Ping Xu, Hongsong Meng, Yongbin Jiang, Zhao LiuTsinghua University and National Supercomputing Center in Wuxi

18th Workshop on HPC in Meteorology, ECMWF, 24‐28 September 2018

Page 2: Preliminary Implementation of GRAPES global model Sunway ...

Outline1. Introduction to GRAPES global model

2. Introduction to Sunway TaihuLight

3. Refactorization of critical computing  kernels of GRAPES

Semi‐Lagrange interpolation

Helmholtz solver

Halo communication optimization

Stencil kernel optimization

4. Optimizations of GRAPES with OpenACC* 

5. Evaluation

6. Summary

Page 3: Preliminary Implementation of GRAPES global model Sunway ...

Introduction of GRAPES global model

Page 4: Preliminary Implementation of GRAPES global model Sunway ...

Meteorological Centers and Research  Institutions in CMA Campus

National Satellite Met. Center

CMACMA Public Met. Service Center

CMA Met. Observation Center

National Met. Information Center

National Climate Center

CMA Weather Modification Center

Chinese Academy of  Met. Sciences8 CMA Specialized Research  Institutes

National Meteorological Center

CMA Numerical Weather Prediction Center

Mission: 0‐10day weather fcst.

Page 5: Preliminary Implementation of GRAPES global model Sunway ...

GRAPES

T63/T106GFS;HLAFS+Typhoon

T213GFS+GEPS+Global Typhoon+SSI;HLAFS + Regional Typhon;wave,dust, air quality

T639GFS;T639GEPS;wave, dust, air‐quality

Imported system

Self‐developed GRAPES system

—from imported to self‐developed core techniques/systems

History of CMA NWP Operational systems

2000/2001GRAPES project

1995/1997

2018

2008

2002

GRAPES_Mesooperation

2006GRAPES_GFSoperation

2016 GRAPES_GFS  & GEPSGRAPES‐Meso & REPS

High‐resolutionwave, dust, air quality

Page 6: Preliminary Implementation of GRAPES global model Sunway ...

• Model– Fully compressible equations with shallow atmosphere approximation– Regular Lat/lon, Arakawa‐C with V at poles– Terrain‐following Z, Charney‐Phillips staggering– 2TL‐SISL time integration– PRM scalar advection (conservation & monotonicity)

• Assimilation– Unified 3/4DVAR framework– Incremental analysis– Digital filter, initialization in 3DVAR, weak constraint in 4DVAR

About GRAPESGRAPES (Global/Regional Assimilation PrEdiction System)

Page 7: Preliminary Implementation of GRAPES global model Sunway ...

GRAPES Dynamic Core

Fully compressible equations

Height‐based terrain‐following coordinate

Option of Hydrostatic and Non‐hydrostatic

Modified Arakawa C lat‐lon horizontal grid

Charney‐Phillips vertical grid

Off‐centered 2‐time‐level semi‐implicit semi‐Lagrangian (SISL) time‐stepping

3D vector form of SISL formulation

PRM for scalar advection 

Preconditioned GCR for Helmholtz Eq.

Spherical & polar effects of trajectory calculation

Polar filter

Page 8: Preliminary Implementation of GRAPES global model Sunway ...

GRAPES Dynamic Core

Fully compressible equations Height‐based terrain‐following coordinate

Option of Hydrostatic and Non‐hydrostatic

Modified Arakawa C lat‐lon horizontal grid

Charney‐Phillips vertical grid

Off‐centered 2‐time‐level semi‐implicit semi‐Lagrangian (SISL) time‐stepping

3D vector form of SISL formulation PRM for scalar advection  Preconditioned GCR for Helmholtz Eq. Spherical & polar effects of trajectory 

calculation Polar filter

Page 9: Preliminary Implementation of GRAPES global model Sunway ...

Physics package• WRF physics for meso‐scale application• Physics for global forecast

– Radiation:• RRTMG LW(V4.71)/SW(V3.61)

– Cumulus:• Simplified Arakawa Schubert

– Microphysics: CMA two‐moment microphysics– Cloud: Prognostic– Land surface: CoLM – Gravity wave drag:

• Kim & Arakawa 1995; Lott & Miller 1997; Alpert, 2004– Small scale orographic form drag : Beljaars, Brown & Wood(2004) 

Page 10: Preliminary Implementation of GRAPES global model Sunway ...

Operational NWP Configs at CMA

Global: GRAPES_GFS/GEPS− 25/50km deterministic/ensemble(M30)− 60 vertical levels (~3hPa top)− 10‐day forecast twice daily− 4DVAR‐100km inner loop

East Asia: GRAPES_Meso/REPS− 3+10/10km deterministic/ensemble(M15)− 50 vertical levels (~50hPa top)− 24‐hour(eight times/day)/120‐hour forecast (two times/day)− 3DVAR

10km

3km25km50km

Page 11: Preliminary Implementation of GRAPES global model Sunway ...

ACC & RMSE of 500hPa height 5‐day forecast

Northern Hemisphere

Page 12: Preliminary Implementation of GRAPES global model Sunway ...

Comparison of precipitation forecast over China among ECMWF, JMA & CMA GRAPES_GFS, Meso

ETS score of 48h forecast of rain beltForecaster

Page 13: Preliminary Implementation of GRAPES global model Sunway ...

New CMA HPC• 2 computers, peak performance: 8189.5 TFLOPS• Parastor 300 storage: 23,088TB• Node/CPU: 3076 nodes, 98432 cores

• Based on Intel Xeon Gold 6124 (2.66GHz 16 cores) processor• 2 CPUs/node (16 cores/CPU)

• 100Gb/s EDR InfiniBand inter connection• RedHat Enterprise Linux Server V7.4and• 4xIntel KNL 7250(68c 1.4GHz) X 6, 73.1TFlops• 2xTesla P100 GPU X 24, 289.5TFlops

Page 14: Preliminary Implementation of GRAPES global model Sunway ...

Heterogeneous and Many‐core architecture becomes mainstream

System/Launch date

Rpeak(PFLOPS)

Rmax(PFLOPS)

Power Efficiency

(MFLOPS/W)Cores Architecture

Summit/201806 187.66 122.30 13889.05 2,282,544 GPU

AccelerationTaihuLight

201606 125.44 93.02 6051.13 10,649,600 Heterogeneous many-core

Sierra/201806 119.19 71.61 / 1,572,480 GPU

AccelerationTianhe-2A

201806 100.68 61.44 3324.56 4,981,760 Matrix2000Acceleration

ABCI/201806 32.58 19.88 / 391,680 GPU

AccelerationPiz Daint201706 25.33 19.59 8622.36 361,760 GPU

AccelerationTitan

201211 27.11 17.59 2142.77 560,640 GPU Acceleration

Sequoia201206 20.13 17.17 2176.58 1,572,864 BQCEvaluation of the potential of current operational model 

refactorization has to be done as early as possible

Page 15: Preliminary Implementation of GRAPES global model Sunway ...

Entire SystemPeak Performance 125 PFlopsLinpackPerformance

93 Pflops / 74.4%

Total Memory 1310.72 TBTotal MemoryBandwidth

5591.45 TB/s

# nodes 40,960# cores 10,649,600

Target Platform: Sunway TaihuLight

Page 16: Preliminary Implementation of GRAPES global model Sunway ...

Core Group 2

Data TransferNetwork

MPE8*8 CPE

Mesh

PPU

iMC

Memory

Core Group 0

MPE8*8 CPE

Mesh

iMC

PPU

Memory

Core Group 1

MPE8*8 CPEMesh

PPU

Core Group 3 iMC

Memory

MPE8*8 CPE

Mesh

PPU

iMC

Memory

NoC

Computing

Core

LDM

ColumnCommunication Bus

ControlNetwork

Registers

RowCommunication

Bus

Transfer Agent (TA)

Memory Level

LDMLevel

Register Level

Computing Level

8*8 CPE Mesh

SPM

SW26010 Processor

Page 17: Preliminary Implementation of GRAPES global model Sunway ...

Core Group 2

Data TransferNetwork

MPE8*8 CPE

Mesh

PPU

iMC

Memory

Core Group 0

MPE8*8 CPE

Mesh

iMC

PPU

Memory

Core Group 1

MPE8*8 CPEMesh

PPU

Core Group 3 iMC

Memory

MPE8*8 CPE

Mesh

PPU

iMC

Memory

NoC

Computing

Core

LDM

ColumnCommunication Bus

ControlNetwork

Registers

RowCommunication

Bus

Transfer Agent (TA)

Memory Level

LDMLevel

Register Level

Computing Level

8*8 CPE Mesh

SW26010 Processor

SPM

Page 18: Preliminary Implementation of GRAPES global model Sunway ...

Core Group 2

Data TransferNetwork

MPE8*8 CPE

Mesh

PPU

iMC

Memory

Core Group 0

MPE8*8 CPE

Mesh

iMC

PPU

Memory

Core Group 1

MPE8*8 CPEMesh

PPU

Core Group 3 iMC

Memory

MPE8*8 CPE

Mesh

PPU

iMC

Memory

NoC

Computing

Core

LDM

ColumnCommunication Bus

ControlNetwork

Registers

RowCommunication

Bus

Transfer Agent (TA)

Memory Level

LDMLevel

Register Level

Computing Level

8*8 CPE Mesh

SW26010 Processor

SPM

Direct Memoy Access (DMA) 22.6 GB/s

Page 19: Preliminary Implementation of GRAPES global model Sunway ...

Core Group 2

Data TransferNetwork

MPE8*8 CPE

Mesh

PPU

iMC

Memory

Core Group 0

MPE8*8 CPE

Mesh

iMC

PPU

Memory

Core Group 1

MPE8*8 CPEMesh

PPU

Core Group 3 iMC

Memory

MPE8*8 CPE

Mesh

PPU

iMC

Memory

NoC

Computing

Core

LDM

ColumnCommunication Bus

ControlNetwork

Registers

RowCommunication

Bus

Transfer Agent (TA)

Memory Level

LDMLevel

Register Level

Computing Level

8*8 CPE Mesh

SW26010 Processor

SPM

Global Load/Store (Gload/Gstore) 1.5 GB/s

Page 20: Preliminary Implementation of GRAPES global model Sunway ...

Register Communication of SW26010

Get C

Get

R Put

Get C

Get

R Put

Get C

Get

R Put

Get C

Get

R Put

//P2P Testif (id%2 == 0)

while(1)putr(data,

id+1);else

while(1)

getr(&data);

Xu, Zhigeng, James Lin, and Satoshi Matsuoka. "Benchmarking SW26010 Many-Core Processor." Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 2017.

Latency: less than 11cycles

Integrated Bandwidth: 637 GB/s

Page 21: Preliminary Implementation of GRAPES global model Sunway ...

Summary of SW26010 Processor

• Heterogeneous architecture

• Manual cache system (SPM)

• Direct memory access (DMA)

• Limited register communication

Different computing resources have been fully controlled by developersalong with high development efforts

Page 22: Preliminary Implementation of GRAPES global model Sunway ...

NRCPC

Principal Programming Model on TaihuLight

MPI+X X : OpenACC* / Athread

One MPI process manages to run on one management core (MPE)

OpenACC* is directive-based programming tool for SW26010 • OpenACC2.0 based

• Extensions for the architecture of SW26010

• Supported by SWACC/SWAFORT compiler

• OpenACC* conducts data transfer between main memory and on-chip memory (SPM), and distributes the kernel workload across compute cores (CPEs)

Athread is the threading library to manage thread on compute core (CPE), which is used in OpenACC* implementation

Page 23: Preliminary Implementation of GRAPES global model Sunway ...

Semi‐Lagrange Interpolation

Page 24: Preliminary Implementation of GRAPES global model Sunway ...

Semi‐Lagrange Interpolation

Halo

Domain

v∆

Grid Point

Departure Point

u∆

If the departure point is outside the domain, halo is needed for interpolation

Page 25: Preliminary Implementation of GRAPES global model Sunway ...

Semi‐Lagrange interpolationhuge halo

normal halo

Heavy Communication

Big array for Data Increase cache missing

Performance at poles is very poor

Time of SL interp

MPI task

Page 26: Preliminary Implementation of GRAPES global model Sunway ...

send departure point’s location

Task BTask BTask ATask A

Semi Lagrange departure point interpolation

4 57 86

30 1 2

4 57 86

30 1 2

3S3

S1

S0

S2

Departure PointGrid Point

3S3

S1

S4

S6

MPI task partition for grid and departure points

thread partition in a MPI task

Send interp. results

The grid points and departure points in a MPI task is partitioned both in horizontal & vertical dimensions, small enough to fit into the 64k bytes SPM of a CPE. Subdomain have small halo needed by interpolation.

Grid Point Departure Point

Page 27: Preliminary Implementation of GRAPES global model Sunway ...

Semi‐Lagrange Interpolation

51.7

26.4

13.7

3.6 1.8 0.83

7.64.1 2.2 0.6 0.4 0.27

6.80 6.44  6.23  6.00 

4.50 

3.07 

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

0

10

20

30

40

50

60

8X8 8X16 16X16 32X32 32X64 64X64

get_upstream_SETTLETotal time in 72steps

MPE CPEs speedup

Seconds

GRAPES global, 0.5°,DP,Taihu light, subroutine “get_upstream_SETTLE” , total time in 72 steps

Speedup

Page 28: Preliminary Implementation of GRAPES global model Sunway ...

Helmholtz Solver

Page 29: Preliminary Implementation of GRAPES global model Sunway ...

12 3

4

5

15

10

Characteristics of Helmholtz Eq.

• Math. Model & Matrix Characteristics– Large Scale non‐symmetrical Linear Equations for Globe 

• 25km H‐resolution, 1440x720x36=37,324,800 

– 19‐diagnal Coefficient MatrixAfter scaling with diagonal elem.

• C1 = 1.0• C10/C15 ~ 10‐1

• C2/C3 ~ 10‐2

• C4/C5 ~ 10‐3

• Others <= 10‐5

– Not good distribution ofeigenvalues

• 100km H‐resolution, max/min ~ 3x104

Page 30: Preliminary Implementation of GRAPES global model Sunway ...

Improved pre‐GCR algorithm (IGCR)

ComparetoBaselineAlgorithmk 10 ‐ 1Allreduce ‐ 1SpMV kBLAS1

• Preconditioning with Restricted Additive-Schwarz domain decomposition scheme

• Only one overlapping layer is enough• Additional halo update is introduced

• Improved GCR algorithm for strength reduction and communication reduction

Page 31: Preliminary Implementation of GRAPES global model Sunway ...

Convergence improvement

31

1.00E‐13

1.00E‐11

1.00E‐09

1.00E‐07

1.00E‐05

1.00E‐03

1.00E‐01 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101

106

111

116

121

126

Resid

ual

Iterations

10KM GRAPES GFS, 64*64 process parallelism

Origin GCR Restricted Additive Schwartz Method

• Convergence of RAS method and origin method

Page 32: Preliminary Implementation of GRAPES global model Sunway ...

Sparse Matrix-vector Multiplication for SW26010 processor

• Chunk access of a k column– Matrix: continuous storage with (19, k, i, j) order and bulk access of 19

nonzeros of each raw of matrix– Vector: stride access following the geometry

• For vector side, each core should read 9 column data to compute one column

32

Core (0,0) Core (0,1) Core (1,0) Core (1,1)ij

k

Page 33: Preliminary Implementation of GRAPES global model Sunway ...

Sparse Matrix-vector Multiplication for SW26010 processor

• For computation of each column, 9 column data has to be access for each core

• Collaborated data access by multiple cores reduces the data access volume in total but introduces more synchronizations

• Good trade-off: 2*2 Cores share data with each other by registercommunication, each core only need to read 4 column data

33

Core (0,0) Core (0,1) Core (0,0) Core (0,1) Core (0,0) Core (0,1)

Core (0,0) Core (0,1) Core (0,0) Core (0,1) Core (0,0) Core (0,1)

Page 34: Preliminary Implementation of GRAPES global model Sunway ...

Sparse Matrix-vector Multiplication for SW26010 processor

• More Cores share data with each other, less x need to read.• Cores which share data with each other need synchronization,

more Cores, more expensive for synchronization.

34

Groups Average Columns ofdata to read per Core

Cores tosynchronization

1*1 9 0

2*2 4 4

4*4 9/4 16

8*8 25/16 64

Page 35: Preliminary Implementation of GRAPES global model Sunway ...

Fine-grained parallel Incomplete LUfactorization for SW26010 processor

• Example of 7 points ILU

35

Computed by Core (0,0)

Computed by Core (0,1)

Computed by Core (1,0)

Computed by Core (7,7)

1 ~ 15 From step 1 to step 15

(0,0)

(0,1)

(0,2)

(0,3)

(0,7)...

1

2

3

4

8

(1,0)(2,0)(3,0)

(2,1) (1,1)

(1,2)

(7,0) ...(7,0)

(1,7)(7,7)

915 ...

......

...8×8

Page 36: Preliminary Implementation of GRAPES global model Sunway ...

Fine-grained parallel Incomplete LUfactorization for SW26010 processor

• How to handle diagonal diagonal communication on SW26010?– Reduction in communication path graph– Diagonal 2D partition

36

0,0 0,1 0,2

1,0 1,1 1,2

2,0 2,1 2,2

3,0 3,1 3,2

0,3

1,3

2,3

3,3

0,0 0,1 0,2

1,0 1,1 1,2

2,0 2,1 2,2

3,0 3,1 3,2

0,3

1,3

2,3

3,3

Column 0 Column 1 Column 2 Column 3

Column 4

Column 5

Column 6

Row 0

Row 1

Row 2

Row 3

Page 37: Preliminary Implementation of GRAPES global model Sunway ...

Communication-Avoiding GCR for Sunway TaihuLight

37

Algorithm:communication avoiding preconditioned generalized conjugate residual method

while not converged do

end while

-10 0 0 0 0 0 0x ,r = b - Ax ,z = M r , p = z

0 1 1 0 2 0 2 1

1,

2,

3, [0 ,1,0 ] , [1,0 ] , [0 ]4, 0 : 1, 5, ,

6

T T Ts s s s

k

Calcula

Let

for

t

do

e

k s

-1 -1 2 -1 s -1 -1 s-10 0 0 0 0 0 0

T T T Tm

T Tk m k k k

k+1 k

V = [p ,M Ap ,(M A) p , ...,(M A) p ,z ,M Az , ...,(M A) z ]

G =V A AV,G =V M AV

m n l

m G n / n Gnl = l

7,

8,

9,

10, 11, 12,

k

k

kj

kj

end for

k

k+1 k k

T Tk+1 j j j

kk+1 k+1 jj=0

s s s s s 0 s

0 s 0 s

+ nm = m - Tn

-m Gn / n Gn

n = m + n

z =Vm , p =Vn , x = x +Vlz = z , p = p

Page 38: Preliminary Implementation of GRAPES global model Sunway ...

Communication-Avoiding GCR for Sunway TaihuLight

38

050

100150200

32*32 64*32 64*64 128*64 128*128 256*128

Tim

e/s

Parallelism

Performance for 10KM GFS case, s=6

Origin GCR CA-GCR

Some computing kernels still have some room for optimization If the scale continues to expand, CA-GCR may beat origin GCR.

Page 39: Preliminary Implementation of GRAPES global model Sunway ...

Halo Communication Optimization

Page 40: Preliminary Implementation of GRAPES global model Sunway ...

Halo communication optimization

• Array transpose• Domain partition on i and j dimensions• (i , k, j) order ‐> (k, i, j) order for DMA friendly data access• The conversion can be conducted at the beginning and at the end of loop code to minimize memory access overhead.

• Assign task to partial CPEs by column• 64‐core simultaneous access is over‐

provisioning for memory subsystem of SW26010 if good access pattern

• Fewer core access makes larger chunk data access, thus better bandwidth utilization

Page 41: Preliminary Implementation of GRAPES global model Sunway ...

Halo communication optimization

• Quite a few halo communications may hurt performance• Solution:

– Offloading the data package to CPE cluster– Performing data package and communication overlapping

Spawn AthreadRoutine

MPI Send/ Recv

Athread Join

OverlappingDone by

CPEs

Done byMPE andNetwork

Page 42: Preliminary Implementation of GRAPES global model Sunway ...
Page 43: Preliminary Implementation of GRAPES global model Sunway ...

Stencil‐like kernel Optimization

Page 44: Preliminary Implementation of GRAPES global model Sunway ...

Stencil kernel optimization (x‐axis stencil case)

• Neighbour CPEs can share the dependent databy register communication rather thanreading from memory

...

XY

Z

Core (0,0)Core (0,1)Core (0,2)Core (0,3)

Core (7,7)

XOZ

Core (0,j-1) Core (0,j) Core (0,j+1)

x[i-1] x[i] x[i+1]

… …

… …

Page 45: Preliminary Implementation of GRAPES global model Sunway ...

NRCPC

Exploring thread‐level parallelism of GRAPES with OpenACC*

Page 46: Preliminary Implementation of GRAPES global model Sunway ...

OpenACC* optimization on TaihuLight

0.00E+005.00E‐031.00E‐021.50E‐022.00E‐022.50E‐023.00E‐023.50E‐024.00E‐024.50E‐025.00E‐02

no acc

(k:1)

(k:2)

(k:4)

(k:8)

(k:16)

(k:32)

all (60

)

!$acc parallel loop & !$acc copyout(c) !$acc collapse(2) tile(k:16) DO j = 1, 45

DO k= 1,60DO i =1, 90

c(i,k,j) = 0.END DO

END DOEND DO!$acc end parallel loop….!$acc parallel loop copyin(A) & !$acc copyout(b) !$acc collapse(2) tile(k:16) DO j = 1, 45

DO k= 1,60DO i =1, 90

A(i,k,j) = B(i,k,j)END DO

END DOEND DO!$acc end parallel loop….

X7

X13

secondTuning tiling configuration

Page 47: Preliminary Implementation of GRAPES global model Sunway ...

OpenACC* optimization on Taihu Light

Start_time = mpi_wtime()!$acc parallel loop packin(its,ite,jts,jte,kts,kte,dc025,cp25)& !$acc copyin(u,v,pi,d2k,th,thref,thv,fv,zsy,rkrf,dy)&!$acc copyout(vl,vn) local(i,k,j,t,zdpdz,uv) &!$acc collapse(2) tile(k:10) annotate(entire(dy,d2k,rkrf,fv))DO j=jts,jte

DO k=kts,kteDO i=its,iteuv=dc025*(u(i,k,j-1)+u(i-1,k,j-1)+u(i,k,j)+u(i-1,k,j))t=(pi(i,k,j)-pi(i,k,j-1))/dy(j)zdpdz=t+zsy(i,k,j)*((((pi(i,k+1,j)-pi(i,k,j))/d2k(k)+(pi(i,k,j)-pi(i,k-1,j))/d2k(k-1))+ &

((pi(i,k+1,j-1)-pi(i,k,j-1))/d2k(k)+(pi(i,k,j-1)-pi(i,k-1,j-1))/d2k(k-1)))*dc025)vl(i,k,j)=-fv(j)*uv-cp25*(thref(i,k,j-1)+thref(i,k+1,j-1)+thref(i,k,j)+thref(i,k+1,j))*zdpdzvn(i,k,j)=-(th(i,k,j-1)+th(i,k+1,j-1)+th(i,k,j) +th(i,k+1,j))*cp25*zdpdz &

-cp25*(thv(i,k,j-1)+thv(i,k+1,j-1)+thv(i,k,j)+thv(i,k+1,j))*zdpdz-krf(k)*v(i,k,j)*3.ENDDO

ENDDOENDDO!$acc end parallel loopend_time = mpi_wtime() if ( myprcid == 0 ) write(*,*) 'vl,vn use ', end_time - start_time

1.0 

6.5 5.7 

8.4 

10.3 

11.5 

0

2

4

6

8

10

12

14

0.0000.0020.0040.0060.0080.0100.0120.0140.0160.0180.020second Thread Speedup

Page 48: Preliminary Implementation of GRAPES global model Sunway ...

OpenACC* optimization on TaihuLight

0 0.2 0.4 0.6 0.8

total

transpose

ice1dskylakeaccno acc

!$acc parallel loop &!$acc copy(A_c,…14 3D arrays) &!$acc copyin(20 3D arrays) &!$acc collapse(2) tile(i:1)

DO j=1, 45DO i=1, 90

CALL ice1D(A_c(:,i,j)… 34 1D arrays and some others) ENDDO

ENDDO!$acc end parallel loop

Transpose    A(i,k,j) A_c(k,i,j)

Transpose    A_c(k,i,j) A(i,k,j)

subroutine ice1d(A…)real :: A(1:60),……do k=kts, kte…end dodo k=kts,kte……end do…end

1500 lines

second

Page 49: Preliminary Implementation of GRAPES global model Sunway ...

Test Results

• Dynamic core of GRAPES global model in Double precision – O.5°resolusion– 0.25°resolusion

• Physics of of GRAPES in single precision– 0.5°resolusion

Page 50: Preliminary Implementation of GRAPES global model Sunway ...

GRAPES‐GLB Dynamic core 0.5, Double Precision

GRAPES global model,0.5°,no physics,72 stepsMPE: “Taihu Light”,no thread parallel processingCPEs: “Taihu Light”with CPEs parallel processingIntel: Node/CPU: Intel Xeon Gold 6124 (2.66GHz 16 cores) processor, 2 CPUs/node (16 cores/CPU) 100Gb/s EDR InfiniBand inter connection

thrd spdp=MPE/CPEs

0

1

2

3

4

5

6

0200400600800

10001200140016001800

8x8 8x16 16x16 16x32 32x32 32x64 64x64

MPECPEsintelthrd spdp

Page 51: Preliminary Implementation of GRAPES global model Sunway ...

6 routines of dynamic core21.98 

19.46 17.14 

12.49 

0

5

10

15

20

25

0

100

200

300

400

500

600

700

8x8 8x16 16x16 16x32

MPE CPEs thrd spdp

8.94  8.72  8.30 6.77 

0

5

10

15

20

25

0

50

100

150

200

250

300

8x8 8x16 16x16 16x32

MPE CPEs thrd spdp

2.73  3.31  3.21  3.12 

0

5

10

15

20

25

0

50

100

150

200

250

300

8x8 8x16 16x16 16x32

MPE CPEs thrd spdp

15.94 

12.96  12.11 

8.45 

0

5

10

15

20

25

0

50

100

150

200

250

8x8 8x16 16x16 16x32

MPE CPEs thrd spdp

6.79  6.44  6.52  6.45 

0

5

10

15

20

25

0

10

20

30

40

50

60

8x8 8x16 16x16 16x32

MPE CPEs thrd spdp

3.89  3.82  3.50  3.10 

0

5

10

15

20

25

0

2

4

6

8

10

12

14

16

8x8 8x16 16x16 16x32

MPE CPEs thrd spdp

Helmholtz solver

SL interp trace substance

Departure points

PRM

SL interp uvwthpi

Linear & none linear GRAPES global model,0.5°,no physics,72 stepsMPE: TaihuLight, only on MPECPEs: TaihuLight, on MPE& CPEs

Trace substance

Page 52: Preliminary Implementation of GRAPES global model Sunway ...

0

1

2

3

4

5

6

7

0

500

1000

1500

2000

2500

16x16 16x32 32x32 32x64 64x64

MPECPEsintelthrd spdp

GRAPES‐GLB Dynamic core 0.25, Double Precision

GRAPES global model,0.25°,no physics,72 stepsMPE: “Taihu Light”,no thread parallel processingCPEs: “Taihu Light”with CPEs parallel processingIntel: Node/CPU: Intel Xeon Gold 6124 (2.66GHz 16 cores) processor, 2 CPUs/node (16 cores/CPU) 100Gb/s EDR InfiniBand inter connection

thrd spdp=MPE/CPEs

Page 53: Preliminary Implementation of GRAPES global model Sunway ...

6 routines of dynamic core 

05101520253035

0200400600800

100012001400

MPE CPEs

05101520253035

050

100150200250300350

MPE CPEs

05101520253035

050

100150200250300350400

MPE CPEs

05101520253035

0

50

100

150

200

250MPE CPEs

05101520253035

0

10

20

30

40

50

60MPE CPEs

05101520253035

02468

10121416

MPE CPEs

Helmholtz solver Departure points SL interp uvwthpi

SL interp trace substances PRM Linear & none linear GRAPES global model,0.25°,no physics,72 stepsMPE: “Taihu Light”,no thread parallel processingCPEs: “Taihu Light”with CPEs parallel processing

Page 54: Preliminary Implementation of GRAPES global model Sunway ...

3 routines in physics package 

0

1

2

3

4

5

6

7

8

0

20

40

60

80

100

120

64 128 256 512 1024 2048 4096

MPE

CPEs

thrd spdp

0.5

0

1

2

3

4

5

6

7

8

0

20

40

60

80

100

120

64 128 256 512 1024 2048 4096

MPE

CPEs

thrd spdp

0.5

012345678

0

20

40

60

80

100

64 128 256 512 102420484096

MPE

CPEs

thrd spdp

0.5

GRAPES global model,no physics,72 stepsMPE: “Taihu Light”, only on MPECPEs: “Taihu Light”on MPE & CPEs

0

1

2

3

4

5

6

7

8

0

20

40

60

80

100

120

256 512 1024 2048 4096

MPE

CPEs

thrd spdp

0.25

0

1

2

3

4

5

6

7

8

0

20

40

60

80

100

120

256 512 1024 2048 4096

MPE

CPEs

thrd spdp

0.25

0

1

2

3

4

5

6

7

8

0

10

20

30

40

50

60

70

80

90

256 512 1024 2048 4096

MPE

CPEs

thrd spdp

0.25

Phy_prep Microphysics Phy_post_back

Page 55: Preliminary Implementation of GRAPES global model Sunway ...

Summary• Current swGRAPES dynamic core has comparable performance with Pi system 

(intel skylake processors system), suggesting that the refactorization of GRAPES having potential to use heterogeneous many‐core platforms

• Fine‐grained parallel algorithm for SpMV and ILU can better utilization of SW26010 processor, encouraging future many‐core oriented algorithm design. 

• Computation and communication overlapping can well fit to the heterogeneous supercomputing system

• OpenACC* has moderate performance improvement and better portability for regular stencil‐like computation loops, The key point of the performance is the best utilization of memory bandwidth

• Programming with Athread can get better performance due to full control of cache, accessing data, communication while it is lack of portability

• The work is far from end and further improvements is required 

Page 56: Preliminary Implementation of GRAPES global model Sunway ...

Future Work

56

• GCR algorithm– Geometric Multigrid Preconditioning– Two communication optimization algorithms: Pipelined GCR and

Chebyshev Methods

• Further Optimization on Sunway TaihuLight

• Improving portability by using OpenACC* enhancement, high performance libraries, and code generation frameworks

Page 57: Preliminary Implementation of GRAPES global model Sunway ...

THANK YOUand questions?


Recommended