+ All Categories
Home > Documents > Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K...

Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K...

Date post: 06-Jun-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
30
Managed by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Transcript
Page 1: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

Instead of the

screen while your

Managed by Triad National Security, LLC for the U.S. Department of Energy’s NNSA

Page 2: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

in Slide, you

logo/management

use one of the two

Los Alamos National Laboratory

Porting mini-apps to ARM HPC systems

Brian J GravelleDave Nystrom

September 2019ARM Research Summit

(unclassified)

LA-UR-19-29059

Page 3: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 3Los Alamos National Laboratory

Introduction

• x86 and Power dominate HPC CPU market

• ARM is new alternative with potential for• Low-power systems• Customization through multiple chip makers• High levels of parallelism

Page 4: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 4Los Alamos National Laboratory

Introduction

How will old codes work on new systems?

Are the performance issues significantly different?

Page 5: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 5Los Alamos National Laboratory

Systems Used

• Compare Intel Skylake to Marvell Thunder X2Skylake Gold 6152 ThunderX2-B1

Cores 44 56

Threads per core 2 4

Clock 2.1GHz 2.0GHz

L1 data cache 32K 32K

L2 cache 1024K 256K

L3 cache 30976K 32768K

Memory Controllers 6 8

SIMD instructions Up to 512 bit 128 bit NEON

*This work used systems funded by the Computational Systems and Software Environments (CSSE) subprogram of LANL’s ASC program NNSA/DOE

Page 6: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 6Los Alamos National Laboratory

Measurement Methodology

• TAU• Performance measurement• Sampling• Profiling

• Caliper• Performance measurement• Instrumentation

• PAPI• Hardware counter interface

Page 7: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 7Los Alamos National Laboratory

Measurements of Interest

• Frontend and Backend Stalls• Cache performance• SIMD instruction use• Energy

Page 8: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 8Los Alamos National Laboratory

Mini-apps used

• SNAP• Mini app for PARTISN • Computation based on 6D neutral particle transport• 3D structured spatial mesh• 3D velocity space uses 2 angle coordinates and an

energy coordinate• Fortran, MPI, OpenMP

Page 9: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 9Los Alamos National Laboratory

SNAP Problem size

• Problem Size• 3D mesh 270x40x64

• Skylake• MPI+openmp threading• 40 ranks with 2 threads each

• ThunderX2• MPI+openmp threading• 40 ranks with 4 threads each

Page 10: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 10Los Alamos National Laboratory

SNAP Analysis (Intel 2t ARM 4t)

0E+002E+124E+126E+128E+121E+131E+13

Frontend Stalls Backend Stalls

SNAP Frontend and Backend Stalls

Skylake ThunderX2

Page 11: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 11Los Alamos National Laboratory

SNAP Analysis (Intel 2t ARM 4t)

Page 12: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 12Los Alamos National Laboratory

SNAP Analysis (Intel 2t)

0.00E+00

2.00E+12

4.00E+12

6.00E+12

8.00E+12

1.00E+13

1.20E+13

1.40E+13

1.60E+13

FLOP count No Vec SSE AVX 2 AVX 512

Cou

nt o

f FP

INS

or C

ycle

s us

ed in

di

m3_

swee

p

Type of Floating Point operations

Time and Instructions vs Vector types for dim3_sweep

FP INS countideal FP insCycle count

Page 13: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 13Los Alamos National Laboratory

SNAP Analysis (ARM 4t)

• ARM SIMD comparison – solve time

No SIMD NEON Speedup

84.7s 66.1s 1.28x

Page 14: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 14Los Alamos National Laboratory

SNAP Analysis

• Vector instructions provide significant improvement to Skylake performance

• Not so much for ThunderX2

Page 15: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 15Los Alamos National Laboratory

SNAP Energy Results

0.00E+00

1.00E+05

2.00E+05

3.00E+05

4.00E+05

Intel 1thread

Intel 2threads

ARM 1thread

ARM 4threads

Package DRAM Total

Page 16: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 16Los Alamos National Laboratory

SNAP Energy Results

0.00E+00

1.00E+06

2.00E+06

3.00E+06

4.00E+06

5.00E+06

Intel 1thread

Intel 2threads

ARM 1thread

ARM 4threads

Intel NoSIMD 2threads

Package DRAM Total

Page 17: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 17Los Alamos National Laboratory

Mini-apps used

• XSBench• “mini-app representing a key computational

kernel of the Monte Carlo neutronics application OpenMC”

• Mostly tabular lookup based on randomly generated energy values

• C, OpenMP

Page 18: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 18Los Alamos National Laboratory

XSBench Problem size

• Problem Size• Standard size reactor with 5E6 particles

• Skylake• OpenMP threading• 88 threads

• ThunderX2• OpenMP threading• Ranging from 56 to 448 threads

Page 19: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 19Los Alamos National Laboratory

XSBench Analysis

0.00E+00

5.00E+11

1.00E+12

1.50E+12

2.00E+12

2.50E+12

3.00E+12

Frontend Stalls Backend Stalls

Frontend and Backend Stalls

Skylake (history) Skylake (event)ThunderX2 (history) ThunderX2 (event)

Page 20: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 20Los Alamos National Laboratory

XSBench Analysis

Page 21: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 21Los Alamos National Laboratory

XSBench Analysis

0%

20%

40%

60%

80%

100%

L1 Miss Rate L2 Miss Rate

Cache Performance

Skylake (history) Skylake (event)ThunderX2 (history) ThunderX2 (event)

Page 22: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 22Los Alamos National Laboratory

XSBench Improvements

• Event Based XSBench• There is a nuclide array with each element storing

random energies other values• The main loop iterates over this array• For each nuclide it looks up data based primarily on

the energy value• Lookups are distributed to threads• Randomly ordered energy values prevent locality in

the lookups

Page 23: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 23Los Alamos National Laboratory

XSBench Improvements

• To improve cache locality sort the nuclide array before performing lookups

• Optimized kernel in distribution sorts based on the energy and the material

• Our version sorts only based on the energy

Page 24: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 24Los Alamos National Laboratory

XSBench Improvements

• “Base” versions are the default (event or history)• “K1” – optimized event-based in distribution• “K2” – event-based version optimized by us

Page 25: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 25Los Alamos National Laboratory

KNL Experiment

0.E+005.E+061.E+072.E+072.E+07

22 44 88 132

176

220

264

272

340

408

Look

ups/

s

Thread Counts

KNL (DRAM) Speedup compared to thread counts

Base (History) Base (Event)k1 event k2 event

0.E+00

5.E+06

1.E+07

2.E+07

22 44 88 132

176

220

264

272

340

408

Look

ups/

s

Thread counts

KNL (HBM) Speedup compared to thread counts

Base (History) Base (Event)k1 event k2 event

Page 26: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 26Los Alamos National Laboratory

Results

0.E+005.E+061.E+072.E+072.E+073.E+073.E+07

Look

ups/

sthread counts

Skylake Speed compared to thread count

Base (History) Base (Event)k1 event k2 event

0.E+005.E+061.E+072.E+072.E+073.E+073.E+07

Look

ups/

s

thread counts

ARM Speed compared to thread count

Base (History) Base (Event)k1 event k2 event

Page 27: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 27Los Alamos National Laboratory

ARM HW Counter comparison

0.00E+002.00E+104.00E+106.00E+108.00E+10

56 112 168 224 280 336 392 448

L1 Data Cache misses

0.00E+002.00E+104.00E+106.00E+10

56 112 168 224 280 336 392 448

L2 Data Cache misses

Event Base Event K1 Event K2

Page 28: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 28Los Alamos National Laboratory

XSBench Energy Results

0.00E+00

2.00E+03

4.00E+03

6.00E+03

8.00E+03

Event Base Event K1 Event K2

Lookups per Joule (Higher is better)

ThunderX2 Skylake

Page 29: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 29Los Alamos National Laboratory

Conclusion

• Getting mini-apps (and presumably the full applications on which they are based) up and running on ARM is easy.

• Performance bottlenecks are for the most part similar; i.e. memory bound vs compute bound does not change.

• Some specific issues may differ; in these cases frontend stalls were higher than backend on the ThunderX2.

Page 30: Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

10/1/2019 | 30Los Alamos National Laboratory

Questions?

Over 70 years at the forefront of supercomputing


Recommended