Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K...

Instead of the

screen while your

Managed by Triad National Security, LLC for the U.S. Department of Energy’s NNSA

in Slide, you

logo/management

use one of the two

Los Alamos National Laboratory

Porting mini-apps to ARM HPC systems

Brian J GravelleDave Nystrom

September 2019ARM Research Summit

(unclassified)

LA-UR-19-29059

10/1/2019 | 3Los Alamos National Laboratory

Introduction

• x86 and Power dominate HPC CPU market

• ARM is new alternative with potential for• Low-power systems• Customization through multiple chip makers• High levels of parallelism


Introduction

How will old codes work on new systems?

Are the performance issues significantly different?


Systems Used

• Compare Intel Skylake to Marvell Thunder X2Skylake Gold 6152 ThunderX2-B1

Cores 44 56

Threads per core 2 4

Clock 2.1GHz 2.0GHz

L1 data cache 32K 32K

L2 cache 1024K 256K

L3 cache 30976K 32768K

Memory Controllers 6 8

SIMD instructions Up to 512 bit 128 bit NEON

*This work used systems funded by the Computational Systems and Software Environments (CSSE) subprogram of LANL’s ASC program NNSA/DOE


Measurement Methodology

• TAU• Performance measurement• Sampling• Profiling

• Caliper• Performance measurement• Instrumentation

• PAPI• Hardware counter interface


Measurements of Interest

• Frontend and Backend Stalls• Cache performance• SIMD instruction use• Energy


Mini-apps used

• SNAP• Mini app for PARTISN • Computation based on 6D neutral particle transport• 3D structured spatial mesh• 3D velocity space uses 2 angle coordinates and an

energy coordinate• Fortran, MPI, OpenMP


SNAP Problem size

• Problem Size• 3D mesh 270x40x64

• Skylake• MPI+openmp threading• 40 ranks with 2 threads each

• ThunderX2• MPI+openmp threading• 40 ranks with 4 threads each


SNAP Analysis (Intel 2t ARM 4t)

0E+002E+124E+126E+128E+121E+131E+13

Frontend Stalls Backend Stalls

SNAP Frontend and Backend Stalls

Skylake ThunderX2


SNAP Analysis (Intel 2t ARM 4t)


SNAP Analysis (Intel 2t)

0.00E+00

2.00E+12

4.00E+12

6.00E+12

8.00E+12

1.00E+13

1.20E+13

1.40E+13

1.60E+13

FLOP count No Vec SSE AVX 2 AVX 512

Cou

nt o

f FP

INS

or C

ycle

s us

ed in

di

m3_

swee

p

Type of Floating Point operations

Time and Instructions vs Vector types for dim3_sweep

FP INS countideal FP insCycle count


SNAP Analysis (ARM 4t)

• ARM SIMD comparison – solve time

No SIMD NEON Speedup

84.7s 66.1s 1.28x


SNAP Analysis

• Vector instructions provide significant improvement to Skylake performance

• Not so much for ThunderX2


SNAP Energy Results

0.00E+00

1.00E+05

2.00E+05

3.00E+05

4.00E+05

Intel 1thread

Intel 2threads

ARM 1thread

ARM 4threads

Package DRAM Total


SNAP Energy Results

0.00E+00

1.00E+06

2.00E+06

3.00E+06

4.00E+06

5.00E+06

Intel 1thread

Intel 2threads

ARM 1thread

ARM 4threads

Intel NoSIMD 2threads

Package DRAM Total


Mini-apps used

• XSBench• “mini-app representing a key computational

kernel of the Monte Carlo neutronics application OpenMC”

• Mostly tabular lookup based on randomly generated energy values

• C, OpenMP


XSBench Problem size

• Problem Size• Standard size reactor with 5E6 particles

• Skylake• OpenMP threading• 88 threads

• ThunderX2• OpenMP threading• Ranging from 56 to 448 threads


XSBench Analysis

0.00E+00

5.00E+11

1.00E+12

1.50E+12

2.00E+12

2.50E+12

3.00E+12

Frontend Stalls Backend Stalls

Frontend and Backend Stalls

Skylake (history) Skylake (event)ThunderX2 (history) ThunderX2 (event)


XSBench Analysis


XSBench Analysis

0%

20%

40%

60%

80%

100%

L1 Miss Rate L2 Miss Rate

Cache Performance

Skylake (history) Skylake (event)ThunderX2 (history) ThunderX2 (event)


XSBench Improvements

• Event Based XSBench• There is a nuclide array with each element storing

random energies other values• The main loop iterates over this array• For each nuclide it looks up data based primarily on

the energy value• Lookups are distributed to threads• Randomly ordered energy values prevent locality in

the lookups



• To improve cache locality sort the nuclide array before performing lookups

• Optimized kernel in distribution sorts based on the energy and the material

• Our version sorts only based on the energy



• “Base” versions are the default (event or history)• “K1” – optimized event-based in distribution• “K2” – event-based version optimized by us


KNL Experiment

0.E+005.E+061.E+072.E+072.E+07

22 44 88 132

176

220

264

272

340

408

Look

ups/

s

Thread Counts

KNL (DRAM) Speedup compared to thread counts

Base (History) Base (Event)k1 event k2 event

0.E+00

5.E+06

1.E+07

2.E+07

22 44 88 132

176

220

264

272

340

408

Look

ups/

s

Thread counts

KNL (HBM) Speedup compared to thread counts



Results

0.E+005.E+061.E+072.E+072.E+073.E+073.E+07

Look

ups/

sthread counts

Skylake Speed compared to thread count


0.E+005.E+061.E+072.E+072.E+073.E+073.E+07

Look

ups/

s

thread counts

ARM Speed compared to thread count



ARM HW Counter comparison

0.00E+002.00E+104.00E+106.00E+108.00E+10

56 112 168 224 280 336 392 448

L1 Data Cache misses

0.00E+002.00E+104.00E+106.00E+10

56 112 168 224 280 336 392 448

L2 Data Cache misses

Event Base Event K1 Event K2


XSBench Energy Results

0.00E+00

2.00E+03

4.00E+03

6.00E+03

8.00E+03

Event Base Event K1 Event K2

Lookups per Joule (Higher is better)

ThunderX2 Skylake


Conclusion

• Getting mini-apps (and presumably the full applications on which they are based) up and running on ARM is easy.

• Performance bottlenecks are for the most part similar; i.e. memory bound vs compute bound does not change.

• Some specific issues may differ; in these cases frontend stalls were higher than backend on the ThunderX2.


Questions?

Over 70 years at the forefront of supercomputing

Date post:	06-Jun-2020
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K...

Documents